CN109670450B

CN109670450B - Video-based man-vehicle object detection method

Info

Publication number: CN109670450B
Application number: CN201811565548.5A
Authority: CN
Inventors: 王景彬; 王思俊; 刘琰; 杜晓琳
Original assignee: Tianjin Tiandy Information Systems Integration Co ltd
Current assignee: Tianjin Tiandy Information Systems Integration Co ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2023-07-25
Anticipated expiration: 2038-12-20
Also published as: CN109670450A

Abstract

The invention provides a man-vehicle object detection method based on video, which comprises the following steps: A. collecting data; B. labeling the data in the step A; C. generating a training set and a testing set for the marked data; D. constructing a convolutional neural network E, and performing model training on the convolutional neural network in the step D; F. and (5) detecting. The invention has the beneficial effects that: the detection rate is high, and the snapshot can achieve good effect for some vehicles which are difficult to detect and identify by the vehicle license plate; the intelligent monitoring system has a relatively accurate identification effect on identification and monitoring of pedestrians and non-motor vehicles, can better realize various monitoring evidence collection, and provides a guarantee for harmonious society, safe traffic and intelligent traveling.

Description

Video-based man-vehicle object detection method

Technical Field

The invention belongs to the technical field of traffic monitoring, and particularly relates to a man-vehicle object detection method based on video.

Background

In the traffic field, detection and separation of vehicles, pedestrians and non-motor vehicles are unavoidable, then the vehicles, pedestrians and non-motor vehicles are respectively monitored, and early warning and recording of illegal events are carried out, so that the detection of vehicles is core in the traffic monitoring technical field, the detection of vehicles is relatively mature and is based on license plates, the accuracy can be up to 99%, but for some license-free vehicles and some engineering vehicles, the snap vehicles cannot be effectively positioned through license plate recognition, and thus, the difficulty of post evidence collection work is caused; pedestrian and non-motor vehicle detection, because the targets are relatively small and the complexity of the gesture features is far higher than that of vehicle detection, so far, the gesture features are still in continuous exploration optimization; pedestrian and non-motor vehicles are also used as main objects in traffic, and are indispensable for effectively detecting components, and have important roles in the progress of traffic fields.

Disclosure of Invention

In view of the above, the present invention is directed to a method for detecting objects of vehicles based on video, so as to solve the above-mentioned drawbacks.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a person and vehicle object detection method based on video comprises the following steps:

A. collecting data;

B. labeling the data in the step A;

C. generating a training set and a testing set for the marked data;

D. constructing a convolutional neural network;

E. performing model training on the convolutional neural network in the step D;

F. and (5) detecting.

Further, in the step a, pictures of various traffic targets in different time periods of various traffic scenes are collected.

In the step B, the circumscribed rectangle of the traffic target is used as a boundary for marking.

Further, in the step C, the training set and the testing set are randomly generated according to the proportion of 4:1 of the number of pictures.

Further, the process of constructing the convolutional neural network in the step D is as follows:

D1. using a VGG network, replacing one filter and space of size greater than or equal to 5*5 at a time with a plurality of filters and spaces of size less than or equal to 3*3 at the base layer of the roll;

D2. and removing the pooling layers respectively connected behind each layer of Conv1_2, conv2_2, conv3_2, conv4_2 and Conv5_2 in the VGG network, and adding four groups of convolution modules of Conv5_x, conv6_x, conv7_x and Conv8_x, wherein Convy_1 in each convolution module group is 1/2 of the channel number of Convy_2.

Further, the training process in the step E is as follows:

E1. c, carrying out data enhancement on the training set and the test set generated in the step C through changing brightness, saturation, rotation and mirroring and image clipping;

E2. position regression and classification probability calculation were performed on the feature maps obtained from conv4_2, conv5_2, conv6_2, conv7_2, and conv8_2, respectively.

Further, the detection process in the step F is as follows:

F1. image color transformation: converting the image color format from YUV to BGR format, the conversion conditions are as follows,

B＝Y+1.779*(U-128)

G＝Y-0.3455*(U-128)-0.7169*(V-128)

R＝Y+1.4075*(V-128)；

F2. sending the BGR format into a trained model for detection, and outputting the detected target type, target position (x, y, w, h) and target confidence coefficient by the model;

F3. filtering the result, namely filtering out false detection targets by limiting the confidence coefficient of the targets; the most accurate target type information is obtained through cross-correlation definition; and obtaining the most accurate target position information (x, y, w, h) through non-maximum value inhibition.

Compared with the prior art, the human-vehicle object detection method based on the video has the following advantages:

the human-vehicle object detection method based on the video is high in detection rate, and good effects can be achieved for some vehicles which are difficult to detect and identify by using the vehicle license plate; the intelligent monitoring system has a relatively accurate identification effect on identification and monitoring of pedestrians and non-motor vehicles, can better realize various monitoring evidence collection, and provides a guarantee for harmonious society, safe traffic and intelligent traveling.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a training flow chart according to an embodiment of the present invention;

fig. 2 is a network configuration diagram of an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

As shown in fig. 1, a method for detecting a person and a vehicle based on video includes the following steps:

A. collecting data;

B. labeling the data in the step A;

C. generating a training set and a testing set for the marked data;

D. constructing a convolutional neural network;

E. performing model training on the convolutional neural network in the step D;

F. and (5) detecting.

In the step A, pictures of various traffic targets are collected in different time periods (including daytime, nighttime, forward light and backlight) of various traffic scenes, wherein the various traffic targets comprise a Bicycle, a Pederstrain, a Car, a Truck, a Bus, a tricycl and an Engineering Truck

In the step B, the circumscribed rectangle of the traffic target is used as a boundary for marking, and the labels are a Bicycle (non-motor Bicycle), a pedestrian, a Car, a Truck, a Bus, a tricycl and an engineering_truck.

In the step C, a training set and a testing set are randomly generated according to the proportion of 4:1 of the number of pictures; the training set is used for training the network weight, and the testing set is used for testing the accuracy of the trained model and preventing the training from fitting.

The process of constructing the convolutional neural network in the step D is as follows:

D1. using VGG network, replacing one large size filter and interval with several small size filters and intervals on the roll base layer, wherein the small size refers to 3*3 and below; the large size generally refers to 5*5 and above, so that the visual field of input data is not changed, an improved convolution network is comprehensively designed for training of vehicles and people, two 3*3 convolutions can be consistent with one 5*5 convolution receptive field, and three 3*3 convolutions can be consistent with one 7*7 convolution receptive field;

D2. the pooling layers respectively connected behind each layer of Conv1_2, conv2_2, conv3_2, conv4_2 and Conv5_2 are removed in the VGG network, so that information loss caused by dimension reduction is reduced, four groups of convolution modules of Conv5_x, conv6_x, conv7_x and Conv8_x are added, wherein Convy_1 in each convolution module group is 1/2 of the number of channels of Convy_2, and through deepening the convolution network and different configurations of channels, more intensive target characteristics can be obtained; a specific network structure diagram is shown in fig. 2;

the training process in the step E is as follows:

E1. the existing samples (the training set and the test set generated in the step C) are subjected to data enhancement by changing brightness, saturation, rotation, mirroring and image clipping, and the existing samples are subjected to data enhancement by changing brightness, saturation, rotation, mirroring, image clipping and other methods due to the fact that deep learning requires a huge number of samples;

E2. position regression and classification probability calculation were performed on the feature maps obtained from conv4_2, conv5_2, conv6_2, conv7_2, and conv8_2, respectively, and position regression and classification probability calculation were performed on the feature maps obtained from conv4_2, conv5_2, conv6_2, conv7_2, and conv8_2, respectively, so that the model had multi-scale features. In this way, even the smaller target has more obvious position characteristics on the larger characteristic diagram, so that the large target and the small target can be detected in a compatible and better mode,

the Loss function Loss used is a weighting of the position error mbox_loc and the confidence error mbox_conf:

Loss(x,c,l,g)＝1/N(mbox_conf(x,c)+α*mbox_loc(x,l,g))

wherein, the weight proportion alpha takes the value range of 0-1, N is the number of positive samples of the group Truth (target real frame), c is the category confidence prediction value, l is the position prediction value of the prediction frame, and g is the position parameter of the group Truth;

the position error mbox_loc is calculated using SmoothL1 loss:

wherein k represents the category to which the group Truth belongs,indicating whether the ith prediction frame and the jth real frame are matched with each other with respect to the category k, wherein the matching is 1, otherwise, the matching is 0;

the confidence error mbox_conf is obtained by the softmaxloss method:

wherein p represents the predicted class of the prediction,indicating whether the ith prediction box and the jth real box match with respect to class p.

The detection process in the step F is as follows:

F1. image color transformation: since the image format acquired by the monitoring camera is YUV type and the network input is BGR format, the image color format is converted from YUV to BGR format, the conversion conditions are as follows,

B＝Y+1.779*(U-128)

G＝Y-0.3455*(U-128)-0.7169*(V-128)

R＝Y+1.4075*(V-128)

F3. filtering the result, namely filtering out false detection targets by limiting the confidence coefficient of the targets; the most accurate target type information is obtained through the definition of an intersectional-over-Union (IOU); the most accurate target location information (x, y, w, h) is obtained by Non-maximal suppression (Non-Maximum Suppression, NMS). In this embodiment, the confidence level is greater than or equal to 0.8 as the correct target, and then the targets smaller than 0.8 are filtered out; taking IOU greater than or equal to 0.4, and considering as the same target; the NMS threshold was taken to be 0.4.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The human-vehicle object detection method based on the video is characterized by comprising the following steps of:

A. collecting data;

B. labeling the data in the step A;

C. generating a training set and a testing set for the marked data;

D. constructing a convolutional neural network;

E. performing model training on the convolutional neural network in the step D;

F. detecting;

the convolutional neural network of the step D comprises the following steps:

conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv6_1, conv6_2, conv7_1, conv7_2, conv8_1, conv8_2 convolutional layers;

the training process in the step E is as follows:

E2. position regression and classification probability calculation were performed on the feature maps obtained from conv4_2, conv5_2, conv6_2, conv7_2, and conv8_2, respectively;

in-process feature graphs obtained from Conv4_2, conv5_2, conv6_2, conv7_2 and Conv8_2 are respectively calculated, and finally, the results on the feature graphs are finally fused to obtain final output result information output.

2. The video-based person-vehicle object detection method according to claim 1, wherein: and A, collecting pictures containing various traffic targets in different time periods of various traffic scenes.

3. The video-based person-vehicle object detection method according to claim 1, wherein: and B, marking by taking the circumscribed rectangle of the traffic target as a boundary.

4. The video-based person-vehicle object detection method according to claim 1, wherein: and C, randomly generating a training set and a testing set according to the proportion of 4:1 of the number of pictures.