CN111626128A

CN111626128A - Improved YOLOv 3-based pedestrian detection method in orchard environment

Info

Publication number: CN111626128A
Application number: CN202010341941.7A
Authority: CN
Inventors: 沈跃; 张健; 刘慧�; 张礼帅; 吴边
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2020-09-04
Anticipated expiration: 2040-04-27
Also published as: CN111626128B

Abstract

The invention discloses an orchard environment pedestrian detection method based on improved YOLOv 3. The method comprises the following steps: s1, acquiring images in an orchard environment, preprocessing the images, and manufacturing an orchard pedestrian sample set; s2, generating anchor box quantity calculation pedestrian candidate boxes by using a K-means clustering algorithm; s3, adding a more detailed feature extraction layer in a YOLOv3 network, and increasing the detection output of the network in a large-scale feature layer to obtain an improved network model YOLO-Z; s4, inputting the training set into a YOLO-Z network for multiple environment training, and then storing the weight file; s5, introducing a Kalman filtering algorithm and improving the Kalman filtering algorithm correspondingly to improve the robustness of the model, solve the problem of missing detection and improve the detection speed. The invention solves the dilemma that the real-time detection speed of the pedestrian is low and the accuracy rate is low in the orchard environment, realizes multi-task training and ensures the detection speed and the detection precision of the pedestrian in the orchard environment.

Description

Improved YOLOv 3-based pedestrian detection method in orchard environment

Technical Field

The invention relates to an orchard environment pedestrian detection method based on improved YOLOv3, aims at pedestrian detection of unmanned agricultural machinery in an orchard environment, and belongs to the technical field of deep learning and pedestrian detection.

Background

With the rapid development of artificial intelligence, agricultural intelligent equipment also enters historical moments, and unmanned agricultural machinery is the central importance of the agricultural intelligent equipment. Obstacle detection is a first problem faced when unmanned agricultural machinery operates in the field, with pedestrian detection being more critical. The current common methods for pedestrian detection include methods based on motion characteristics, methods based on shape information, methods based on pedestrian models, methods based on stereo vision, methods based on neural networks, methods based on wavelets and support vector machines, and the like

Pedestrian detection in an orchard environment faces a series of problems: (1) the multi-pose problem of pedestrians. The pedestrian target is severely non-rigid and the pedestrian may assume a variety of different positions, either still or walking, or standing or squatting. (2) Complexity problems of the detection scenario. Pedestrians are mixed with the background and are difficult to separate. (3) The real-time performance of the pedestrian detection and tracking system. In practical application, certain requirements are often required for the response speed of a detection tracking system, the construction of a pedestrian detection algorithm is often complex, and the resistance of the system in real time is further improved. (4) The problem of occlusion. In the actual environment, a large amount of shelters exist among people and among people and things. The pedestrian detection is carried out by combining a computer vision method with deep learning, and a research basis is provided for realizing the pedestrian detection.

Disclosure of Invention

In order to solve the requirement of intelligent unmanned agricultural machinery for pedestrian detection in an orchard environment, the invention provides a pedestrian detection method in the orchard environment based on improved YOLOv3, which treats detection as a regression problem, directly processes the whole image by using a convolution network structure, and simultaneously predicts the type and position of detection.

The orchard environment pedestrian detection method based on the improved YOLOv3 comprises the following steps:

step 1: acquiring images of pedestrians in an orchard environment;

collecting images of various positions of an orchard where pedestrians are shot under a depth camera, wherein the shot images of the pedestrians under different sheltering environments, the images under different weather conditions and the images of the pedestrians at different distances including a short distance, a middle distance and a long distance are shot;

step 2: preprocessing the image acquired in the step 1, and constructing a standard pedestrian detection data set;

and step 3: putting the training set processed and manufactured in the step 2 into a convolution characteristic device to extract pedestrian characteristics, generating anchor box numbers through a K-means clustering algorithm to generate a predicted pedestrian boundary box, and performing multi-scale fusion prediction by using a similar FPN network to improve the accuracy of the boundary box and category prediction, wherein the specific steps are as follows:

(3.1): randomly selecting the width and the height of a coordinate frame as a first clustering center;

(3.2): the nth clustering center selection principle is that the probability of selecting a frame with larger similarity distance with the current n-1 clustering centers is larger;

(3.3): looping (3.2) until all initial cluster centers are determined;

(3.4): calculating IoU (interaction over Union) of the rest other coordinate frames with the clustering center one by one to obtain a similarity distance IoU loss between the two frames, and dividing the coordinate frames into the class to which the clustering center with the minimum similarity distance belongs;

(3.5): after all the coordinate frames are traversed, calculating the mean values of the width and the height of the coordinate frames in each class to be used as the clustering center of next iteration;

(3.6): and (3.4) and (3.5) are repeated until the Total IoU loss difference value of adjacent iterations is smaller than the threshold value or the iteration number is reached, and the clustering algorithm is stopped.

The improved K-means clustering algorithm mainly optimizes the selection of initial clustering centers, so that the similarity distance between the initial clustering centers is as large as possible, the method can effectively shorten the clustering time, and improve the clustering effect of the algorithm.

And 4, step 4: a more detailed feature extraction layer is added to the YOLOv3 network, and the detection output of the network in a large-scale feature layer is increased to obtain an improved network model YOLO-Z, which is specifically as follows:

(4.1): firstly, the size of the training set image obtained in step 2 is adjusted to 608 × 608, and the IOU threshold value is set to 0.45, and the confidence threshold value is set to 0.5. Each grid predicts B bounding boxes (bounding boxes), each containing 1 confidence score (confidence score) value, 4 coordinate values and C class probabilities, wherein B is the number of output feature layers anchor boxes where the grid is located. Then, for a scaled output feature layer, the final output dimension is;

the formula used for clustering is d (box, centroid) ═ 1-IOU (box, centroid)

Wherein, box is a priori box, centroid is a clustering center, IOU (box, centroid) is the intersection ratio of two areas, and when d (box, centroid) is less than or equal to a measurement threshold, the width and height of the anchor box are determined.

The formula for predicting the bounding box is

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

Wherein, c_xAnd c_yIs the distance between the divided cell and the horizontal and vertical coordinates of the upper left corner of the image, p_w、p_hThe width and height of the bounding box before prediction, t_xAnd t_yTo predict the central relative parameter, σ (t)_x) And σ (t)_y) Respectively the distances of the center of the prediction frame from the upper left corner of the cell in which the prediction frame is positioned in the horizontal direction and the vertical direction, b_xAnd b_yRespectively the abscissa and ordinate of the predicted bounding box center, b_wAnd b_hRespectively the width and height of the predicted bounding box.

The confidence of the predicted bounding box is formulated as

Wherein Pr (object) is 0 or 1, 0 indicates no object in the image, and 1 indicates an object;

representing the intersection ratio between the predicted bounding box and the actual bounding box, the confidence score (confidence score) reflects whether the target is contained and the accuracy of the predicted location in the case of containing the target. Setting the confidence threshold value to be 0.5, and deleting the predicted boundary box when the confidence of the predicted boundary box is less than 0.5; and when the confidence of the predicted bounding box is more than 0.5, the predicted bounding box is reserved.

(4.2): adding a more detailed feature extraction layer in a YOLOv3 network, and increasing the detection output of the network in a large-scale feature layer;

the YOLOv3 network adopts a large amount of convolution every time a downsampling is carried out, and according to a receptive field calculation formula, as the number of layers of the network increases, the receptive field increases, and the extracted features are formed by fusing more information, namely, the deeper the network, the more the network pays attention to the global information. The proportion of the pedestrians in the picture is small, the method belongs to small-size object detection, in the deep characteristic diagram, the influence of the information of the small-size object on the characteristic diagram is small, and the information loss of the small-size object is serious. Therefore, a more detailed feature extraction layer is added, on the basis of reserving the original output layer of YOLOv3, the output feature map is subjected to upsampling to obtain a size feature map, the size feature map is combined with the shallow size convolution layer, and the prediction output is carried out after a plurality of convolution layers to obtain a model YOLO-Z;

(4.3): then, multi-scale fusion prediction is carried out on the pedestrians through a similar FPN network, and the target detection is regarded as a regression problem through the YOLOv3 algorithm, so that a mean square error loss function is adopted;

the mean square error loss function (loss function) used for class prediction is formulated as

Wherein: s²Representing the mesh size of the final characteristic diagram of the network, B representing each meshThe number of prediction boxes, x, y, w, h, represents the center and width and height of the box, CⁱRepresenting the confidence that the prediction box is positioned to the pedestrian,

confidence, P, that a pedestrian is actually present within the frame_i(c) The confidence level of the predicted pedestrian is represented,

the confidence of the pedestrian really exists;

judging whether the jth bounding box in the ith grid is in charge of the object or not and judging the IOU maximum bounding box of the real existing target frame group _ judge _ box of the object;

a bounding box representing the IOU maximum; lambda [ alpha ]_coordIs a weight coefficient for the bounding box coordinate prediction error; lambda [ alpha ]_noobjA weight representing a classification error;

judging whether the center of the object falls into the grid i or not, wherein the grid contains the center of the object and is responsible for predicting the class probability of the object;

and 5: inputting the training set into a YOLO-Z network for multiple environment training, and then storing a weight file of the training set;

based on the improved YOLO-Z network, the convolution layer is added, more detailed feature extraction is obtained, and a small target is detected in a shallow layer to obtain a pedestrian detection model under an orchard. The method comprises the steps of obtaining the width and the height of candidate frames by using the prior knowledge of a data set and a K-means clustering algorithm, analyzing the influence of different candidate frame numbers on the model performance, obtaining a model with optimal performance under limited computing resources, and adjusting and optimizing training parameters in order to improve the positioning accuracy of the model.

Step 6: a Kalman filtering algorithm is introduced and is correspondingly improved to improve the robustness of the model, solve the problem of missing detection and improve the detection speed, and the method specifically comprises the following steps:

the Kalman filtering algorithm outputs an optimal recursion algorithm, and the tracking process is mainly divided into two steps: and (4) predicting and updating. After a state space model and an observation equation are established for the system, the filter can obtain a predicted value of the state variable at the current moment according to the noise of the system and the state variable at the previous moment, and the state variable is updated by combining the observation value at the current moment to finally realize the state of prediction estimation.

The equations for the state space model and the observation equation, which are the basis for the Kalman filter to perform iterative tracking, are as follows:

X_i＝A_i|i-1X_i-1+w_i-1

Z_i＝Hx_i+v_i

wherein, X_iAnd X_i-1Is the system state corresponding to the i time and the i-1 time, A_i|i-1Is a state transition matrix, which is related to the state variable of the system and the target motion mode; z_iAnd H is an observation matrix, and is related to the system matrix and the observation value. W_i-1Corresponding to system noise, v_iThe measured noise of the corresponding system is subject to normal distribution, and the covariance is Q, R respectively.

The invention has the following advantages:

the selection of initial clustering centers is optimized by using an improved K-means clustering algorithm, so that the similarity distance between the initial clustering centers is as large as possible, the clustering time can be effectively shortened, and the clustering effect of the algorithm is improved;

secondly, a convolution layer is added on a shallow layer of the network to obtain more detailed feature extraction, and a small target is detected on the shallow layer, so that the detection precision of the obtained YOLO-Z model is greatly improved, the detection speed is obviously improved, and the requirement of real-time detection is met;

and thirdly, the YOLO-Z model is combined with a Kalman filtering algorithm to improve the missing detection rate in the place with obvious shielding and further accelerate the detection speed.

Drawings

Fig. 1 is a flowchart of an overall implementation process of a pedestrian detection method in an orchard environment based on improved YOLOv3 in an embodiment of the present invention.

FIG. 2 is a diagram illustrating network coordinate prediction in multitasking training according to an embodiment of the present invention.

Fig. 3 is a shallow layer add convolution feature extractor based on YOLOv3 network in an embodiment of the present invention.

Fig. 4 is a diagram illustrating the effect of the orchard pedestrian detection method based on the improved YOLOv3 in the embodiment of the invention; (a) is in a static state; (b) is in a moving state; (c) is in a normal posture; (d) is in an abnormal posture; (e) is a large target; (f) is a medium target; (g) is a small target.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the invention provides an orchard environment pedestrian detection method based on improved YOLOv3, which comprises the following steps:

step 1: acquiring images of pedestrians in an orchard environment;

as shown in fig. 2-3, step 3: putting the training set processed and manufactured in the step 2 into a convolution characteristic device to extract pedestrian characteristics, generating anchor box numbers through a K-means clustering algorithm to generate a predicted pedestrian boundary box, and performing multi-scale fusion prediction by using a similar FPN network to improve the accuracy of the boundary box and category prediction, wherein the specific steps are as follows:

(3.3): looping (3.2) until all initial cluster centers are determined;

(3.4): calculating IoU the rest coordinate frames with the clustering centers one by one to obtain the similarity distance IoU loss between the two frames, and dividing the coordinate frames into the class to which the clustering center with the minimum similarity distance belongs;

the formula used for clustering is d (box, centroid) ═ 1-IOU (box, centroid)

The formula for predicting the bounding box is

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

The confidence of the predicted bounding box is formulated as

representing the intersection ratio between the predicted bounding box and the actual bounding box, the confidence score (confidence score) reflects whether the target is contained and the accuracy of the predicted location in the case of containing the target. Device for placingSetting the confidence threshold to be 0.5, and deleting the predicted boundary box when the confidence of the predicted boundary box is less than 0.5; and when the confidence of the predicted bounding box is more than 0.5, the predicted bounding box is reserved.

Wherein: s²Representing the mesh size of the final characteristic diagram of the network, B representing the number of prediction boxes of each mesh, x, y, w, h representing the center and width and height of the boxes, CⁱRepresenting the confidence that the prediction box is positioned to the pedestrian,

the confidence of the pedestrian really exists;

judging whether the jth bounding box in the ith grid is responsible for the object or not and judging the largest bounding box of the IOU of the group _ route _ box of the object;

a bounding box representing the IOU maximum; lambda [ alpha ]_noobjRepresents a weight of the classification error;

X_i＝A_i|i-1X_i-1+w_i-1

Z_i＝Hx_i+v_i

wherein, X_iAnd X_i-1Is the system state corresponding to the i time and the i-1 time, A_i|i-1Is a state transition matrix, which is related to the state variable of the system and the target motion mode; z_iAnd H is an observation matrix, and is related to the system matrix and the observation value. W_i-1Corresponding to system noise, v_iThe measured noise of the corresponding system is subject to normal distribution, and the covariance is Q, R respectively. As shown in fig. 4, the pedestrian detection method in the orchard environment based on the improved YOLOv3 is based on YOLOv3, and aiming at the detection difficulties such as illumination, shielding and the like in the orchard environment, the YOLO-Z network is provided through the improvement of the training sample and the network structure, the K-means clustering algorithm and the Kalman filtering algorithm are improved, so that the accuracy and recall rate of pedestrian detection are improved, the requirement of real-time detection is met, the requirement of a network model on hardware is reduced, and the pedestrian detection of an intelligent agricultural machine in the orchard is facilitated.

In conclusion, the orchard environment pedestrian detection method based on the improved YOLOv3 is disclosed. The method comprises the following steps: s1, acquiring images in an orchard environment, preprocessing the images, and manufacturing an orchard pedestrian sample set; s2, generating anchor box quantity calculation pedestrian candidate boxes by using a K-means clustering algorithm; s3, adding a more detailed feature extraction layer in a YOLOv3 network, and increasing the detection output of the network in a large-scale feature layer to obtain an improved network model YOLO-Z; s4, inputting the training set into a YOLO-Z network for multiple environment training, and then storing the weight file; s5, introducing a Kalman filtering algorithm and improving the Kalman filtering algorithm correspondingly to improve the robustness of the model, solve the problem of missing detection and improve the detection speed. The invention solves the dilemma that the real-time detection speed of the pedestrian is low and the accuracy rate is low in the orchard environment, realizes multi-task training and ensures the detection speed and the detection precision of the pedestrian in the orchard environment.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. An orchard environment pedestrian detection method based on improved YOLOv3 is characterized by comprising the following steps:

step 1: acquiring images of pedestrians in an orchard environment;

and step 3: processing the pedestrian detection data set in the step 2, making a training set, putting the training set into a convolution characteristic device for characteristic extraction of pedestrian characteristics, generating an anchor box number through a K-means clustering algorithm to generate predicted pedestrian boundary frame expansion data, and performing multi-scale fusion prediction by using a similar FPN network to improve the accuracy of boundary frame and category prediction;

and 4, step 4: a more detailed feature extraction layer is added in the YOLOv3 network, and the detection output of the network in a large-scale feature layer is increased to obtain an improved network model YOLO-Z;

step 6: an improved Kalman filtering algorithm is introduced to improve the robustness of the model, solve the problem of missing detection and improve the detection speed.

2. The orchard environment pedestrian detection method based on the improved YOLOv3 is characterized in that the pedestrian bounding box expansion data is generated by generating the anchor box number through a K-means clustering algorithm, and the method comprises the following specific steps:

step 3.1: randomly selecting the width and the height of a coordinate frame as a first clustering center;

step 3.2: the nth clustering center selection principle is that the probability of selecting a frame with larger similarity distance with the current n-1 clustering centers is larger;

step 3.3: looping step 3.2 until all initial cluster centers are determined;

step 3.4: calculating IoU (interaction over Union) of the rest other coordinate frames with the clustering center one by one to obtain a similarity distance IoU loss between the two frames, and dividing the coordinate frames into the class to which the clustering center with the minimum similarity distance belongs;

step 3.5: after all the coordinate frames are traversed, calculating the mean values of the width and the height of the coordinate frames in each class to be used as the clustering center of next iteration;

step 3.6: and (5) repeating the step 3.4 and the step 3.5 until the Total IoU loss difference value of adjacent iterations is smaller than the threshold value or the iteration times are reached, and stopping the clustering algorithm.

3. The orchard environment pedestrian detection method based on the improved YOLOv3 is characterized in that the step 4 is specifically as follows:

step 4.1: firstly, adjusting the size of the training set image obtained in the step 2 to 608 × 608, setting an iou (interaction over union) threshold to 0.45, setting a confidence threshold to 0.5, predicting B bounding boxes for each grid, wherein each bounding box comprises 1 confidence score value, 4 coordinate values and C class probabilities, wherein B is the number of output feature layers anchor boxes where the grids are located, and then, for the output feature layers of the size, the final output dimension is;

the formula used for clustering is

d(box,centroid)＝1-IOU(box,centroid)

Wherein, box is a prior frame, centroid is a clustering center, IOU (box, centroid) is the intersection ratio of two areas, and when d (box, centroid) is less than or equal to a measurement threshold, the width and height of the anchor box are determined;

the formula for predicting the bounding box is

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

Wherein, c_xAnd c_yIs the distance between the divided cell and the horizontal and vertical coordinates of the upper left corner of the image, p_w、p_hThe width and height of the bounding box before prediction, t_xAnd t_yTo predict the central relative parameter, σ (t)_x) And σ (t)_y) Respectively the distances of the center of the prediction frame from the upper left corner of the cell in which the prediction frame is positioned in the horizontal direction and the vertical direction, b_xAnd b_yRespectively the abscissa and ordinate of the predicted bounding box center, b_wAnd b_hThe width and height of the predicted bounding box, respectively;

the confidence of the predicted bounding box is formulated as

Wherein Pr (object) is 0 or 1, and 0 represents the figureNo object in the image, 1 indicates that there is an object;

representing the intersection ratio between the predicted boundary box and the actual boundary box, reflecting whether the target is contained or not and the accuracy of the predicted position under the condition of containing the target by using a confidence score, setting a confidence threshold value to be 0.5, and deleting the predicted boundary box when the confidence of the predicted boundary box is less than 0.5; when the confidence coefficient of the predicted boundary box is greater than 0.5, the predicted boundary box is reserved;

step 4.2: adding a more detailed feature extraction layer in a YOLOv3 network, and increasing the detection output of the network in a large-scale feature layer;

the YOLOv3 network adopts a large amount of convolution every time a downsampling is carried out, and according to a receptive field calculation formula, the receptive field is increased along with the increase of the number of layers of the network, the extracted characteristics are formed by fusion of more information, namely the deeper the network is, the more the network focuses on global information, the smaller the proportion of pedestrians in pictures, the detection belongs to small-size objects, in a deep characteristic diagram, the influence of the information of the small-size objects on the characteristic diagram is smaller, and the information loss of the small-size objects is serious; therefore, a more detailed feature extraction layer is added, on the basis of reserving the original output layer of YOLOv3, the output feature map is subjected to upsampling to obtain a size feature map, the size feature map is combined with the shallow size convolution layer, and the prediction output is carried out after a plurality of convolution layers to obtain a model YOLO-Z;

step 4.3: then, multi-scale fusion prediction is carried out on the pedestrians through a similar FPN network, and the target detection is regarded as a regression problem through the YOLOv3 algorithm, so that a mean square error loss function is adopted;

the mean square error loss function used for class prediction is expressed by

the confidence of the pedestrian really exists;

judging whether the jth bounding box in the ith grid is in charge of the object or not and judging the largest bounding box of the IOU of the real existing target frame group _ route _ box of the object;

and judging whether the center of the object falls into the grid i or not, wherein the grid contains the center of the object and is responsible for predicting the class probability of the object.

4. The orchard environment pedestrian detection method based on the improved YOLOv3 is characterized in that the step 6 specifically comprises the following steps:

the improved Kalman filtering algorithm outputs an optimal recursion algorithm, and the tracking process is mainly divided into two steps: and (4) predicting and updating. After a state space model and an observation equation are established for the system, the filter can obtain a predicted value of the state variable at the current moment according to the noise of the system and the state variable at the previous moment, and the state variable is updated by combining the observation value at the current moment to finally realize the state of prediction estimation;

X_i＝A_i|i-1X_i-1+w_i-1

Z_i＝Hx_i+v_i

wherein, X_iAnd X_i-1Is the system state corresponding to the i time and the i-1 time, A_i|i-1Is a state transition matrix, which is related to the state variable of the system and the target motion mode; z_iRepresenting the observation state of the system at time i, H being an observation matrix, related to the system matrix and the observation value, W_i-1Corresponding to system noise, v_iThe measured noise of the corresponding system is subject to normal distribution, and the covariance is Q, R respectively.