CN111626128B

CN111626128B - Pedestrian detection method based on improved YOLOv3 in orchard environment

Info

Publication number: CN111626128B
Application number: CN202010341941.7A
Authority: CN
Inventors: 沈跃; 张健; 刘慧�; 张礼帅; 吴边
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2023-07-21
Anticipated expiration: 2040-04-27
Also published as: CN111626128A

Abstract

The invention discloses a pedestrian detection method in an orchard environment based on improved YOLOv 3. The method comprises the following steps: s1, acquiring images in an orchard environment, and preprocessing to manufacture an orchard pedestrian sample set; s2, generating an anchor box number by using a K-means clustering algorithm to calculate pedestrian candidate frames; s3, adding a finer feature extraction layer in the YOLOv3 network, and increasing the detection output of the network in the large-scale feature layer to obtain an improved network model YOLO-Z; s4, inputting the training set into a YOLO-Z network to perform multiple environmental training, and then storing a weight file of the training set; s5, introducing a Kalman filtering algorithm and carrying out corresponding improvement to improve the robustness of the model, solve the problem of missed detection and improve the detection speed. The invention solves the dilemma of low real-time detection speed and low accuracy of pedestrians in an orchard environment, realizes multitask training, and ensures the detection speed and accuracy of pedestrians in the orchard environment.

Description

Pedestrian detection method based on improved YOLOv3 in orchard environment

Technical Field

The invention relates to a pedestrian detection method in an orchard environment based on improved YOLOv3, which aims at pedestrian detection of unmanned agricultural machinery in the orchard environment and belongs to the technical field of deep learning and pedestrian detection.

Background

With the rapid development of artificial intelligence, agricultural intelligent equipment also enters historic moment, and unmanned agricultural machinery is a heavy weight of the agricultural intelligent equipment. Obstacle detection is a primary problem faced when unmanned agricultural machinery is operated in the field, where pedestrian detection is more critical. The methods commonly used for pedestrian detection at present include a method based on motion characteristics, a method based on shape information, a method based on pedestrian models, a method based on stereoscopic vision, a method based on neural networks, a method based on wavelets and a support vector machine, and the like

Pedestrian detection in an orchard environment faces a series of problems: (1) pedestrian multi-pose problem. The pedestrian target is severely non-rigid and the pedestrian may take on a variety of different poses, either resting or walking, or standing or squatting. (2) detect complexity problems of the scene. Pedestrians are mixed with the background and are difficult to separate. And (3) the problem of real-time performance of the pedestrian detection and tracking system. In practical application, a certain requirement is often made on the reaction speed of the detection tracking system, the construction of a pedestrian detection algorithm is often complex, and the real-time resistance of the system is further improved. (4) occlusion problem. In a practical environment, there are a large number of occlusions from person to person. The method adopts computer vision to combine with deep learning to detect pedestrians, and provides a research foundation for realizing pedestrian detection.

Disclosure of Invention

In order to solve the above requirements of intelligent unmanned agricultural machinery in an orchard environment on pedestrian detection, the invention provides a pedestrian detection method in the orchard environment based on improved YOLOv3, detection is regarded as regression problem, the whole image is directly processed by using a convolution network structure, and the detection type and position are predicted.

The invention discloses a pedestrian detection method in an orchard environment based on improved YOLOv3, which comprises the following steps:

step 1: collecting pedestrian images in an orchard environment;

collecting images of pedestrians at various positions of an orchard where the pedestrians are under the depth cameras, wherein the photographed images of the pedestrians under different shielding environments, the images under different weather conditions and the images of the pedestrians at different distances comprise short-distance, medium-distance and long-distance images of the pedestrians;

step 2: preprocessing the image acquired in the step 1, and constructing a standard pedestrian detection data set;

step 3: putting the training set processed in the step 2 into a convolution characteristic device to extract pedestrian characteristics, generating an anchor box number through a K-means clustering algorithm to generate a predicted pedestrian boundary frame, and performing multi-scale fusion prediction by using a similar FPN network to improve the accuracy of boundary frame and category prediction, wherein the method comprises the following specific steps of:

(3.1): randomly selecting the width and height of a coordinate frame as a first clustering center;

(3.2): the n-th cluster center selection principle is that the larger the similarity distance between the n-th cluster center and the current n-1 cluster centers is, the larger the probability that the frame is selected;

(3.3): cycling (3.2) until all initial cluster centers are determined;

(3.4): calculating IoU (Intersection over Union) the rest coordinate frames with the clustering centers one by one to obtain similarity distances IoU loss between the two frames, and dividing the coordinate frames into classes with the smallest similarity distances to the clustering centers;

(3.5): after all coordinate frames are traversed, calculating the average value of the width and the height of the coordinate frames in each class, and taking the average value as a clustering center of next iteration;

(3.6): repeating (3.4) and (3.5) until the Total IoU loss difference of adjacent iterations is smaller than a threshold value or the number of iterations is reached, and stopping the clustering algorithm.

The improved K-means clustering algorithm mainly optimizes the selection of initial clustering centers, so that the similarity distance between the initial clustering centers is as large as possible.

Step 4: in the more detailed feature extraction layer of the YOLOv3 network, the detection output of the network in the large-scale feature layer is increased, and an improved network model YOLO-Z is obtained, specifically as follows:

(4.1): the training set image size obtained in step 2 is adjusted to 608×608, and the IOU threshold is set to 0.45, and the confidence threshold is set to 0.5. Each lattice predicts B bounding boxes, each bounding box containing 1 confidence score, 4 coordinate values and C class probabilities, where B is the number of output feature layers anchor boxes where the lattice is located. Then, for the output feature layer of size, the final output dimension is;

the clustering uses the formula d (box, centroid) =1-IOU (box, centroid)

Wherein, box is a priori frame, centroid is cluster center, IOU (box, centroid) is the ratio of the intersection of two regions, when d (box, centroid) is less than or equal to the measurement threshold value, confirm the width and height of the anchor box.

The formula of the prediction boundary box is

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

Wherein c _x And c _y For the distance of the divided cells from the abscissa of the upper left corner of the image, p _w 、p _h The width and height of the bounding box before prediction, t _x And t _y To predict the center relative parameter, σ (t _x ) Sum sigma (t) _y ) The distances from the center of the prediction frame to the horizontal direction and the vertical direction of the upper left corner of the cell where the prediction frame is positioned are respectively b _x And b _y Respectively the abscissa, the ordinate, b of the predicted bounding box center _w And b _h The width and height of the predicted bounding box, respectively.

Confidence formula for prediction bounding box is

Wherein Pr (object) is 0 or 1, 0 indicates no object in the image, and 1 indicates an object;representing the ratio of intersection between the predicted bounding box and the actual bounding box, the confidence score reflects whether the target is contained and the accuracy of the predicted location if the target is contained. If the confidence threshold is set to 0.5, deleting the predicted bounding box when the confidence of the predicted bounding box is less than 0.5; and when the confidence of the predicted boundary frame is greater than 0.5, reserving the predicted boundary frame.

(4.2): the more detailed feature extraction layer is added in the YOLOv3 network, and the detection output of the network in the large-scale feature layer is increased;

the YOLOv3 network adopts a large number of convolutions every time it performs downsampling, and according to the receptive field calculation formula, as the number of layers of the network increases, the receptive field increases, and the extracted features are formed by more information fusion, i.e. the deeper the network, the more concerned the global information. The pedestrian occupies smaller proportion in the picture, belongs to small-size object detection, and in a deep feature map, the influence of information of the small-size object on the feature map is smaller, and the information loss of the small-size object is serious. Therefore, a more detailed feature extraction layer is added, on the basis of keeping the original output layer of the YOLOv3, the output feature map is up-sampled to obtain a size feature map and is combined with a shallow size convolution layer, and then the model YOLO-Z is obtained through prediction output after a plurality of convolution layers;

(4.3): then, carrying out multi-scale fusion prediction on pedestrians through a similar FPN network, wherein the target detection is regarded as a regression problem by a YOLOv3 algorithm, so that a mean square error loss function is adopted;

the mean square error loss function (loss function) formula used for class prediction is

Wherein: s is S ² Representing the grid size of the final characteristic diagram of the network, B representing the number of predicted frames of each grid, x, y, w and h representing the center and width and height of the frames, C ⁱ Representing the confidence that the prediction box is located to the pedestrian,representing confidence level of true existence of pedestrian in frame, P _i (c) Representing predicted pedestrian confidence,/->The confidence of pedestrians exists truly; />Refers to judging whether the jth binding box in the ith grid is responsible for the objectThe body and the IOU maximum bound box of the real existing target frame group_trunk of the object; />Representing the largest boundingbox of the IOU; lambda (lambda) _coord Weight coefficients for the bounding box coordinate prediction error; lambda (lambda) _noobj Weights representing classification errors classification error; />Judging whether the center of an object falls in a grid i, wherein the center of the object is contained in the grid, and predicting the class probability of the object;

step 5: inputting the training set into a YOLO-Z network to perform various environmental training, and then storing a weight file of the training set;

based on the improved YOLO-Z network, a convolution layer is added, finer feature extraction is obtained, and small targets are detected in a shallow layer, so that a pedestrian detection model under an orchard is obtained. The prior knowledge of the data set is utilized, the width and height of the candidate frames are obtained through a K-means clustering algorithm, the influence of different candidate frame numbers on the performance of the model is analyzed, the model with optimal performance is obtained under limited computing resources, and training parameters are optimized for improving the positioning accuracy of the model.

Step 6: the Kalman filtering algorithm is introduced and the corresponding improvement is carried out to improve the robustness of the model, solve the problem of missing detection and improve the detection speed, and the specific steps are as follows:

the Kalman filtering algorithm outputs an optimal recurrence algorithm, and the tracking process is mainly divided into two steps: prediction and updating. After a state space model and an observation equation are established for the system, the filter can obtain a predicted value of the state variable at the current moment according to the noise of the system and the state variable at the previous moment, and then the state variable is updated by combining with the observed value at the current moment to finally realize the state of prediction estimation.

The state space model and the observation equation are formulated as follows, which are the basis for iterative tracking by a Kalman filter:

X _i ＝A _i|i-1 X _i-1 +w _i-1

Z _i ＝Hx _i +v _i

wherein X is _i And X _i-1 Is the system state corresponding to the moment i and the moment i-1, A _i|i-1 Is a state transition matrix, and is related to state variables of the system and a target movement mode; z is Z _i The observation state of the system at the moment i is shown, H is an observation matrix, and the observation matrix and the observation value are related. W (W) _i-1 Corresponding to system noise, v _i The measurement noise of the corresponding system is subjected to normal distribution, and the covariance is Q, R respectively.

The invention has the following advantages:

1. the improved K-means clustering algorithm is used for optimizing the selection of initial clustering centers, so that the similarity distance between the initial clustering centers is as large as possible, the clustering time can be effectively shortened, and the clustering effect of the algorithm is improved;

2. a convolution layer is added on a shallow layer of a network to obtain finer feature extraction, and small targets are detected on the shallow layer, so that the detection accuracy of the obtained YOLO-Z model is greatly improved, the detection speed is also remarkably improved, and the requirement of real-time detection is met;

3. the YOLO-Z model is combined with a Kalman filtering algorithm, so that the omission ratio of a place where shielding is obvious can be improved, and the detection speed of the place can be further increased.

Drawings

Fig. 1 is a flowchart of an overall implementation process of a pedestrian detection method in an orchard environment based on improved YOLOv3 in an embodiment of the present invention.

FIG. 2 is a diagram of network-based coordinate prediction in multitasking training in accordance with an embodiment of the present invention.

FIG. 3 is a YOLOv3 network-based shallow-layer added convolution feature extractor in an embodiment of the present invention.

FIG. 4 is an effect diagram of an orchard pedestrian detection method based on improved Yolov3 in an embodiment of the present invention; (a) is in a resting state; (b) being in a mobile state; (c) is in a normal posture; (d) an abnormal posture; (e) is a large target; (f) is a mid-target; (g) is a small target.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the invention provides a pedestrian detection method in an orchard environment based on improved YOLOv3, which comprises the following steps:

step 1: collecting pedestrian images in an orchard environment;

as shown in fig. 2-3, step 3: putting the training set processed in the step 2 into a convolution characteristic device to extract pedestrian characteristics, generating an anchor box number through a K-means clustering algorithm to generate a predicted pedestrian boundary frame, and performing multi-scale fusion prediction by using a similar FPN network to improve the accuracy of boundary frame and category prediction, wherein the method comprises the following specific steps of:

(3.3): cycling (3.2) until all initial cluster centers are determined;

(3.4): calculating IoU the rest coordinate frames with the clustering centers one by one to obtain similarity distances IoU loss between the two frames, and dividing the coordinate frames into classes with the smallest similarity distances to the clustering centers;

the clustering uses the formula d (box, centroid) =1-IOU (box, centroid)

The formula of the prediction boundary box is

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

Confidence formula for prediction bounding box is

Wherein: s is S ² Representing the grid size of the final characteristic diagram of the network, B representing the number of predicted frames of each grid, x, y, w and h representing the center and width and height of the frames, C ⁱ Representing the confidence that the prediction box is located to the pedestrian,representing confidence level of true existence of pedestrian in frame, P _i (c) Representing predicted pedestrian confidence,/->The confidence of pedestrians exists truly; />Judging whether the jth binding box in the ith grid is responsible for the object or not, and judging the IOU maximum binding box of the group_trunk of the object;representing the largest boundingbox of the IOU; lambda (lambda) _noobj A weight representing classification error;judging whether the center of an object falls in a grid i, wherein the center of the object is contained in the grid, and predicting the class probability of the object;

X _i ＝A _i|i-1 X _i-1 +w _i-1

Z _i ＝Hx _i +v _i

wherein X is _i And X _i-1 Is the system state corresponding to the moment i and the moment i-1, A _i|i-1 Is a state transition matrix, and is related to state variables of the system and a target movement mode; z is Z _i The observation state of the system at the moment i is shown, H is an observation matrix, and the observation matrix and the observation value are related.W _i-1 Corresponding to system noise, v _i The measurement noise of the corresponding system is subjected to normal distribution, and the covariance is Q, R respectively. As shown in fig. 4, the pedestrian detection method based on the improved YOLOv3 in the orchard environment is based on YOLOv3, aims at detection difficulties such as illumination and shielding in the orchard environment, improves a K-means clustering algorithm and a Kalman filtering algorithm by providing a YOLO-Z network in the improvement of training samples and network structures, improves the accuracy and recall rate of pedestrian detection, meets the requirement of real-time detection, reduces the requirement of a network model on hardware, and is beneficial to intelligent agricultural machinery pedestrian detection in the orchard.

In summary, the invention provides a pedestrian detection method in an orchard environment based on improved YOLOv 3. The method comprises the following steps: s1, acquiring images in an orchard environment, and preprocessing to manufacture an orchard pedestrian sample set; s2, generating an anchor box number by using a K-means clustering algorithm to calculate pedestrian candidate frames; s3, adding a finer feature extraction layer in the YOLOv3 network, and increasing the detection output of the network in the large-scale feature layer to obtain an improved network model YOLO-Z; s4, inputting the training set into a YOLO-Z network to perform multiple environmental training, and then storing a weight file of the training set; s5, introducing a Kalman filtering algorithm and carrying out corresponding improvement to improve the robustness of the model, solve the problem of missed detection and improve the detection speed. The invention solves the dilemma of low real-time detection speed and low accuracy of pedestrians in an orchard environment, realizes multitask training, and ensures the detection speed and accuracy of pedestrians in the orchard environment.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The pedestrian detection method based on the improved YOLOv3 in the orchard environment is characterized by comprising the following steps of:

step 1: collecting pedestrian images in an orchard environment;

step 3: processing the pedestrian detection data set in the step 2, then making a training set, putting the training set into a convolution characteristic device to extract pedestrian characteristics, generating an anchor box number through a K-means clustering algorithm to generate predicted pedestrian boundary frame expansion data, and performing multi-scale fusion prediction by using a similar FPN network to improve the accuracy of boundary frame and category prediction;

step 4: the more detailed feature extraction layer is added in the YOLOv3 network, the detection output of the network in the large-scale feature layer is increased, and an improved network model YOLO-Z is obtained;

the step 4 is specifically as follows:

step 4.1: firstly, adjusting the size of the training set image obtained in the step 2 to 608 multiplied by 608, setting a IoU threshold to 0.45, representing Intersection over Union by using an IoU, and setting a confidence threshold to 0.5, predicting B bounding boxes for each grid, wherein each bounding box comprises 1 confidence score value, 4 coordinate values and C category probabilities, wherein B is the number of output feature layers of the grid, and then, for the output feature layers of the size, the final output dimension is;

the formula for clustering is

d(box,centroid)＝1-IOU(box,centroid)

Wherein, box is a priori frame, centroid is a cluster center, IOU (box, centroid) is the intersection ratio of two areas, when d (box, centroid) is smaller than or equal to the measurement threshold value, confirm the width and height of the anchor box;

the formula of the prediction boundary box is

b _x ＝σ(t _x )+c _x

b _y ＝σ(t _y )+c _y

Wherein c _x And c _y For the distance of the divided cells from the abscissa of the upper left corner of the image, p _w 、p _h The width and height of the bounding box before prediction, t _x And t _y To predict the center relative parameter, σ (t _x ) Sum sigma (t) _y ) The distances from the center of the prediction frame to the horizontal direction and the vertical direction of the upper left corner of the cell where the prediction frame is positioned are respectively b _x And b _y Respectively the abscissa, the ordinate, b of the predicted bounding box center _w And b _h The width and height of the predicted bounding box, respectively;

confidence formula for prediction bounding box is

Wherein Pr (object) is 0 or 1, 0 indicates no object in the image, and 1 indicates an object;representing the intersection ratio between the predicted boundary frame and the actual boundary frame, wherein the confidence coefficient confidence score reflects whether the target is contained or not and the accuracy of the predicted position under the condition that the target is contained, and the confidence coefficient threshold value is set to be 0.5, and deleting the predicted boundary frame when the confidence coefficient of the predicted boundary frame is smaller than 0.5; when the confidence coefficient of the predicted boundary frame is larger than 0.5, reserving the predicted boundary frame;

step 4.2: the more detailed feature extraction layer is added in the YOLOv3 network, and the detection output of the network in the large-scale feature layer is increased;

according to a receptive field calculation formula, as the number of layers of the network increases, the receptive field increases, the extracted features are formed by more information fusion, namely, the deeper the network is, the more concerned global information is, the smaller the proportion of pedestrians in the picture is, the detection of small-size objects is realized, in a deep feature map, the influence of the information of the small-size objects on the feature map is smaller, and the information loss of the small-size objects is serious; therefore, a more detailed feature extraction layer is added, on the basis of keeping the original output layer of the YOLOv3, the output feature map is up-sampled to obtain a size feature map and is combined with a shallow size convolution layer, and then the model YOLO-Z is obtained through prediction output after a plurality of convolution layers;

step 4.3: then, carrying out multi-scale fusion prediction on pedestrians through a similar FPN network, wherein the target detection is regarded as a regression problem by a YOLOv3 algorithm, so that a mean square error loss function is adopted;

the mean square error loss function formula used for category prediction is as follows

Wherein: s is S ² Representing the grid size of the final characteristic diagram of the network, B representing the number of predicted frames of each grid, x, y, w and h representing the center and width and height of the frames, C ⁱ Representing the confidence that the prediction box is located to the pedestrian,representing confidence level of true existence of pedestrian in frame, P _i (c) Representing predicted pedestrian confidence,/->The confidence of pedestrians exists truly; />Judging whether the jth binding box in the ith grid is responsible for the object, and judging the IOU maximum binding box of the jth binding box with the truly existing target frame group_trunk of the object; />Representing the largest binding box of the IOU; lambda (lambda) _coord Weight coefficients for the bounding box coordinate prediction error; lambda (lambda) _noobj Weights representing classification errors classification error; />Judging whether the center of an object falls in a grid i, wherein the center of the object is contained in the grid, and predicting the class probability of the object;

step 6: an improved Kalman filtering algorithm is introduced to improve the robustness of the model, solve the problem of missed detection and improve the detection speed.

2. The pedestrian detection method in an orchard environment based on improved YOLOv3 of claim 1, wherein the generating of the predicted pedestrian bounding box expansion data by generating the number of anchor boxes through a K-means clustering algorithm comprises the following specific steps:

step 3.1: randomly selecting the width and height of a coordinate frame as a first clustering center;

step 3.2: the n-th cluster center selection principle is that the larger the similarity distance between the n-th cluster center and the current n-1 cluster centers is, the larger the probability that the frame is selected;

step 3.3: cycling step 3.2 until all initial cluster centers are determined;

step 3.4: calculating IoU the rest coordinate frames with the clustering centers one by one to obtain similarity distances IoU loss between the two frames, and dividing the coordinate frames into classes with the smallest similarity distances to the clustering centers;

step 3.5: after all coordinate frames are traversed, calculating the average value of the width and the height of the coordinate frames in each class, and taking the average value as a clustering center of next iteration;

step 3.6: repeating the steps 3.4 and 3.5 until the Total IoU loss difference value of the adjacent iterations is smaller than a threshold value or the number of iterations is reached, and stopping the clustering algorithm.

3. The method for pedestrian detection in an orchard environment based on improved YOLOv3 of claim 1, wherein step 6 is specifically as follows:

the improved Kalman filtering algorithm outputs an optimal recurrence algorithm, and the tracking process is mainly divided into two steps: predicting and updating; after a state space model and an observation equation are established for the system, a filter can obtain a predicted value of a state variable at the current moment according to noise of the system and the state variable at the previous moment, and then the state variable is updated by combining with the observed value at the current moment to finally realize a predicted estimated state;

X _i ＝A _i|i-1 X _i-1 +w _i-1

Z _i ＝Hx _i +v _i

wherein X is _i And X _i-1 Is the system state corresponding to the moment i and the moment i-1, A _i|i-1 Is a state transition matrix, and is related to state variables of the system and a target movement mode; z is Z _i Representing the observation state of the system at the moment i, wherein H is an observation matrix, and is related to the system matrix and the observation value, W _i-1 Corresponding to system noise, v _i Corresponding systemIs subjected to normal distribution, and covariance is Q, R respectively.