CN115482489A

CN115482489A - Improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method and system

Info

Publication number: CN115482489A
Application number: CN202211141822.2A
Authority: CN
Inventors: 王增煜; 陈申宇; 陈泽涛; 刘秦铭; 张攀; 黄海波; 马灿桂; 陈志健
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-16

Abstract

The invention discloses a pedestrian detection and trajectory tracking method and system for a power distribution room based on improved YOLO v3, wherein the method comprises the following steps: s1, selecting frames of a video, carrying out format conversion, designing a reasonable video frame selection interval, and converting an intercepted single-frame picture into a JPG format picture which can be processed by a model; s2, image preprocessing and pedestrian detection, wherein the image preprocessing is carried out on the picture after format conversion, and the picture is input into a pedestrian detection model to judge whether a pedestrian is detected; if a pedestrian is detected, performing step S3, and if no pedestrian is detected, ending the method; s3, image segmentation, namely preprocessing operation before image identification is carried out; and S4, tracking and identifying the track and obtaining a result, judging whether the alarm condition is met or not according to the track tracking result, and if so, alarming. According to the invention, the improved YOLOv3 model is used as a detector of Deep SORT, so that the problem of inaccurate detection and tracking of the traditional model is solved, and effective monitoring on pedestrians in the power distribution room is realized.

Description

Improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method and system

Technical Field

The invention belongs to the technical field of pedestrian detection, and particularly relates to a power distribution room pedestrian detection and trajectory tracking method and system based on improved YOLO v 3.

Background

With the increase of the workload of the operation and construction of the power distribution room year by year, although the power enterprise unit formulates a perfect safety risk management system, the power distribution room has multiple and wide field working points and complex working environment, and cannot comprehensively prevent the safety risk only by depending on management regulations, field working responsible persons and safety supervision personnel to perform duties. The defects of civil defense are overcome by urgent need of technical defense, namely, by means of effective technical means, the construction of the on-site safety prevention and control system can effectively guarantee the implementation of key risk points, high-risk point early warning is issued to operation construction workers in real time, and the final aim of reducing safety accidents is fulfilled.

The traditional pedestrian recognition model is inaccurate in detection and tracking and difficult to realize effective monitoring of pedestrians in the power distribution room.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art and provides a power distribution room pedestrian detection and trajectory tracking method and system based on improved YOLO v 3.

In order to achieve the purpose, the invention adopts the following technical scheme:

a pedestrian detection and trajectory tracking method for a power distribution room based on improved YOLOv3 comprises the following steps:

s1, selecting frames of a video, carrying out format conversion, designing a reasonable video frame selection interval according to comprehensive consideration of scenes, requirements and performance, and converting an intercepted single-frame picture into a JPG format picture which can be processed by a model;

s2, image preprocessing and pedestrian detection, wherein the image preprocessing is carried out on the picture after format conversion, and the picture is input into a pedestrian detection model to judge whether a pedestrian is detected; if a pedestrian is detected, performing step S3, and if no pedestrian is detected, ending the method;

s3, image segmentation, namely preprocessing operation before image identification;

and S4, tracking and identifying the track and obtaining a result, judging whether the alarm condition is met or not according to the track tracking result, and if so, alarming.

Further, the image preprocessing specifically includes image enhancement, sharpening, smoothing, denoising, gray scale adjustment, and image clipping.

Further, the pedestrian detection model is specifically an improved YOLOv3 model, and the YOLOv3 model comprises a feature extraction network Darknet-33 and a YOLO multi-scale prediction layer;

the image size of a YOLOv3 model is input to be 416 x 3, 5 times of downsampling is conducted on a feature extraction network, the obtained feature map is output to a YOLO multi-scale prediction layer, tensor dimensions are expanded through a concat mechanism, connection of upsampling and a shallow feature map is achieved, feature maps with the sizes of 13 x 13, 26 x 26 and 52 x 52 are output, each feature map can be predicted by a corresponding grid, 3 prediction frames are arranged at each grid point to be responsible for prediction of an area, and as long as the center of an object is located in the area, the object is determined by the grid point.

Further, the YOLOv3 model inputs the picture size 416 × 416 × 3, firstly convolves the picture, the channel is changed to 32, residual convolution is performed once, the shape is changed to 208 × 208 × 64, two times of residual convolution are performed again, the shape is changed to 104 × 104 × 128, then eight times of residual convolution are performed, the shape is changed to 52 × 52 × 256, and the layer is output as a first feature layer;

carrying out eight times of residual convolution, changing the shape into 26 multiplied by 512, and outputting the layer as a second characteristic layer;

performing residual convolution for four times again, wherein the shape is changed into 13 multiplied by 1024, outputting the layer as a third characteristic layer, performing convolution for 5 times on the layer of characteristic layer, adding one-way sampling and the second layer of characteristic layer, performing convolution for 3 multiplied by 3 and 1 multiplied by 1, and outputting a result with the shape of 13 multiplied by B; after adding with the second layer of characteristic layer, one path of up-sampling is carried out to add with the first layer of characteristic layer, convolution with the size of 3 multiplied by 3 and the size of 1 multiplied by 1 is carried out to output a result with the shape of 26 multiplied by B; adding the first layer feature layer and then convolving with 3 × 3 and 1 × 1 to obtain an output result with the shape of 56 × 56 × B, where B is the number of prediction types +1+4.

Further, the YOLOv3 model scales the input image to 416 × 416 for training, then uniformly divides the input image into S × S grids, predicts a bounding box in each grid for target detection, predicts and outputs the position and the type of the bounding box of each type of target each time, and respectively calculates the confidence of each bounding box; if the center point of the object is on a certain grid, the grid is responsible for predicting the object, and three anchor points are generated on the object;

predicting three boundary frames by each grid through dimension clustering and logistic regression by means of three anchor point frames; the grid responsible for predicting each object needs to predict 5 values, namely the position of the grid and the probability value of the object; wherein the position of the self-body needs 4 values to be determined, including the central point coordinate of the prediction frame and the width and the height of the prediction frame, which are respectively t _x 、t _y 、t _w 、t _h ；

Wherein the latter two terms relate to the k value;

if the center target is offset in the cell relative to the upper left corner of the image (c _x ，c _y ) The height and width of the anchor box is denoted p _w And p _h Then, the specific calculation formula of the corrected bounding box is as follows:

b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

further, the YOLOv3 model includes losses in three aspects, namely, losses of a prediction box, a confidence level, and a category, and a specific loss function is as follows:

wherein L is _loc To predict frame loss, λ _coord In order to be the weight coefficient,

indicating whether the neurons of cell i in the jth sliding window include the normalized value, x, of the detection target object _i 、y _i 、w _i And h _i Respectively representing the coordinate of the central point, the length and the width of a sliding window of the prediction unit cell i as a prediction value; the corresponding real value is expressed as

And

L _conf for confidence loss, the overlapping area of the sliding window and the real detection object region is expressed, lambda _noobj To penalize errors, C _i In order to predict the degree of confidence,

is the corresponding true value;

L _cls as class loss, p _i (c) The cell i representing the prediction contains the conditional probability of the kth class object, and

representing the corresponding true probability value.

Further, the improved YOLOv3 model is specifically an improved YOLOv3 multi-scale detection network formed by replacing Darknet-53 with a Wide-Darknet-33 novel feature extraction network and adding a detection layer of 104 x 104.

Furthermore, the Wide-Darknet-33 comprises 13 residual blocks, 32 convolutional layers and 1 full-connection layer, the depth is reduced by reducing the convolutional layers of the Darknet-53, and meanwhile, the network is widened, so that the feature extraction is more accurate in width;

in the aspect of multi-scale detection, in order to reduce the omission rate of small head and shoulder targets under a complex background, two groups of convolution groups of 1 × 1 and 3 × 3 in front of three YOLO layers of a YOLOv3 model are removed respectively.

Further, when the prior frame is obtained, clustering is performed by adopting a K-means + + algorithm, which specifically comprises the following steps:

selecting initial K clustering central points by a wheel disc method, wherein the total number of samples is Q, and the clustering is K types, and the specific clustering process is as follows:

firstly, randomly picking a point in a data set as a certain type of clustering center point;

step two, calculating the distance D (x) between each point x and the central point, and summing the calculated distances to obtain Sum (D (x));

step three, normalizing the data by using D (x)/Sum (D (x)) to regenerate a sequence consisting of the first N D (x)/Sum (D (x)) and Sum (D (x)/Sum (D (x)), wherein N is an integer which is added with 1 from 1 in sequence, the range of N is [1, Q ], then taking a Random value Random from [0,1], and then using Random- = Sum (D (x)/Sum (D (x)) until the Random value is less than or equal to 0, wherein the point is the next cluster center;

step four, repeating the step two and the step three until K clustering centers are selected;

and step five, performing a K-means algorithm by using the K initial clustering centers.

The invention also comprises a pedestrian detection and trajectory tracking system of the power distribution room, wherein the system adopts the method provided by the invention and comprises a video frame selection module, an image segmentation and preprocessing module, a pedestrian detection module and a trajectory tracking and identifying module;

the video frame selection module is used for setting a certain frame selection interval to select frames of the video according to comprehensive consideration of scenes, requirements and performance, and converting the intercepted single-frame picture into a JPG format picture which can be processed by a model;

the image segmentation and preprocessing module is used for preprocessing the picture after format conversion; the image segmentation device is also used for carrying out image segmentation processing;

a pedestrian detection module that detects a pedestrian on the basis of the pedestrian detection model on the input preprocessed image;

and the track tracking identification module is used for carrying out track tracking identification on the pedestrian when the pedestrian detection module detects the pedestrian and obtaining a result, judging whether the alarm condition is met or not according to the track tracking result, and giving an alarm if the alarm condition is met.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the improved YOLO v3 algorithm model is used as a detector of Deep SORT, the problem that the detection and tracking of the traditional model are inaccurate is solved, and effective monitoring on pedestrians in the power distribution room is realized; the intelligent power distribution room has the advantages of real-time detection, accurate alarm, automatic pushing of alarm information and the like, and 24 × 7 all-weather and omission-free real-time detection in the power distribution room area is realized through video monitoring, and the intrusion alarm information in the peripheral area is automatically and intelligently pushed to an attendant.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention;

FIG. 2 is a diagram of the structure of YOLOv 3;

FIG. 3a is a schematic diagram of the predicted bounding box of YOLO v3 in single bin;

FIG. 3b is a schematic diagram of the predicted bounding box of YOLO v3 in single bin;

FIG. 4 is a diagram illustrating the relative position of the modified bounding box;

FIG. 5 is a block diagram of YOLO v3 and modified YOLO v 3;

FIG. 6 is a flow chart of the K-means clustering algorithm.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

As shown in fig. 1, the method for detecting pedestrians and tracking tracks in a power distribution room based on improved YOLOv3 of the present invention includes the following steps:

s2, image preprocessing and pedestrian detection, wherein the image preprocessing is carried out on the picture after format conversion, and the picture is input into a pedestrian detection model to judge whether a pedestrian is detected; the image preprocessing specifically includes image enhancement/sharpening, smoothing, denoising, gray scale adjustment, image clipping, and other processing.

S3, preprocessing the picture before identification according to a pedestrian detection result;

In this embodiment, the pedestrian detection model is specifically an improved YOLOv3 model, and the YOLOv3 model includes a feature extraction network Darknet-33 and a YOLO multiscale prediction layer;

the image size of a YOLOv3 model is input to be 416 x 3, 5 times of downsampling is conducted on a feature extraction network, the obtained feature map is output to a YOLO multi-scale prediction layer, tensor dimensions are expanded through a concat mechanism, connection of upsampling and a shallow feature map is achieved, feature maps with the sizes of 13 x 13, 26 x 26 and 52 x 52 are output, each feature map can be predicted by a corresponding grid, 3 prediction frames are arranged at each grid point to be responsible for prediction of an area, and as long as the center of an object is located in the area, the object is determined by the grid point. By means of the multi-scale method, small objects can be detected better. The YOLOv3 model network structure is shown in fig. 2.

The YOLOv3 model inputs the picture size of 416 × 416 × 3, firstly convolves the picture, the channel is changed to 32, residual convolution is carried out once, the shape is changed to 208 × 208 × 64, then residual convolution is carried out twice, the shape is changed to 104 × 104 × 128, then residual convolution is carried out eight times, the shape is changed to 52 × 52 × 256, and the layer is output as a first feature layer;

eight residual convolution times are carried out, the shape is changed into 26 multiplied by 512, and the layer is taken as a second characteristic layer to be output;

performing residual convolution for four times again, wherein the shape is changed into 13 multiplied by 1024, outputting the layer as a third characteristic layer, performing convolution for 5 times on the layer of characteristic layer, adding one-way sampling and the second layer of characteristic layer, performing convolution for 3 multiplied by 3 and 1 multiplied by 1, and outputting a result with the shape of 13 multiplied by B; adding the first layer of feature layer and the second layer of feature layer, performing up-sampling on one path of the result, adding the result to the first layer of feature layer, performing convolution with the length of 3 multiplied by 3 and the length of 1 multiplied by 1, and outputting a result with the shape of 26 multiplied by B; convolution with 3 × 3, 1 × 1 after addition with the first layer feature layer yields an output result with a shape of 56 × 56 × B, where B is the number of predicted species +1+4.

The YOLO v3 model scales an input image to 416 x 416 dimensions for training, then uniformly divides the image into S x S grids, predicts a bounding box in each grid for target detection, outputs the position and the category of the bounding box of each type of target in each prediction, and respectively calculates the confidence coefficient of each bounding box; if the center point of the object falls on a certain grid, the grid is responsible for predicting the object, and three anchor blocks are generated on the object, as shown in fig. 3a and 3 b.

Predicting three boundary frames by each grid through dimension clustering and logistic regression by means of three anchor point frames; the grid responsible for predicting each object needs to predict 5 values, namely the position of the grid and the probability value of the object; the self position needs 4 values to be determined, including the central point coordinate of the prediction frame and the width and height of the prediction frame, which are t _x 、t _y 、t _w 、t _h ；

Wherein the latter two terms relate to the k value;

if the center target is offset in the cell relative to the upper left corner of the image (c) _x ，c _y ) The height and width of the anchor box is denoted p _w And p _h Then, the specific calculation formula of the corrected bounding box is as follows:

b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

fig. 4 is a schematic diagram showing the relative position of the corrected bounding box.

The YOLO v3 model includes losses in three aspects, namely, losses of a prediction box, confidence and a category, and a specific loss function is as follows:

And

is the corresponding true value;

representing the corresponding true probability value.

As shown in fig. 5, the improved YOLOv3 model is specifically an improved YOLOv3 multi-scale detection network, which is improved by adopting a Wide-Darknet-33 novel feature extraction network to replace Darknet-53 and adding a detection layer of 104 × 104.

The Wide-Darknet-33 comprises 13 residual blocks, 32 convolutional layers and 1 full connection layer, reduces the depth by reducing the convolutional layers of the Darknet-53, and widens the network so as to ensure that the feature extraction is more accurate in width;

in the aspect of multi-scale detection, in order to reduce the omission rate of small head and shoulder targets under a complex background, two groups of convolution groups of 1 × 1 and 3 × 3 in front of three YOLO layers of the YOLOv3 model are removed respectively.

The embodiment aims at the problems that when the YOLOv3 obtains the prior frame, the K-means algorithm has a large dependence on an initial value, so that the clustering effect is not accurate, the obtained anchor frame has a low matching degree with data characteristics, and the detection precision is low, and when the prior frame is obtained, the K-means + + algorithm is adopted for clustering. FIG. 6 is a flow chart of the K-means algorithm.

The clustering by adopting the K-means + + algorithm specifically comprises the following steps:

firstly, randomly picking a point in a data set as a certain type of clustering central point;

step three, normalizing the data by using D (x)/Sum (D (x)) to regenerate a sequence consisting of the first N D (x)/Sum (D (x)) and Sum (D (x)/Sum (D (x)), wherein N is an integer which is added with 1 from 1 in sequence, and the range of N is [1, Q ], then taking a Random value Random from [0,1], and then using Random- = Sum (D (x)/Sum (D (x)) until the Random value is less than or equal to 0, wherein the point is the next cluster center;

In another embodiment, a pedestrian detection and trajectory tracking system for a power distribution room is further provided, wherein the system adopts the method of the above embodiment and comprises a video frame selection module, an image segmentation and preprocessing module, a pedestrian detection module and a trajectory tracking identification module;

the image segmentation and preprocessing module is used for preprocessing the picture after format conversion; the image segmentation processing module is also used for carrying out image segmentation processing;

the pedestrian detection module is used for detecting pedestrians on the input preprocessed image based on the pedestrian detection model;

It should also be noted that in the present specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A power distribution room pedestrian detection and trajectory tracking method based on improved YOLOv3 is characterized by comprising the following steps:

2. The improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 1, wherein the image pre-processing specifically comprises image enhancement, sharpening, smoothing, denoising, gray scale adjustment, and image cropping.

3. The improved yollov 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 1, wherein the pedestrian detection model is an improved yollov 3 model, and the yollov 3 model comprises a feature extraction network Darknet-33 and a YOLO multi-scale prediction layer;

4. The improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 3, characterized in that the YOLOv3 model inputs a picture size of 416 x 3, the picture is convolved first, the channel is changed to 32, one residual convolution is performed, the shape is changed to 208 x 64, two residual convolutions are performed again, the shape is changed to 104 x 128, eight residual convolutions are performed again, the shape is changed to 52 x 256, and the layer is output as a first feature layer;

5. The improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 1, wherein the YOLOv3 model scales an input image to 416 x 416 dimensions for training, then uniformly divides the input image into S x S grids, predicts bounding boxes in each grid for target detection, outputs the bounding box position and category of each type of target for each prediction, and calculates the confidence of each bounding box respectively; if the center point of the object is on a certain grid, the grid is responsible for predicting the object, and three anchor points are generated on the object;

predicting three boundary frames by each grid through dimension clustering and logistic regression by means of three anchor point frames; the grid responsible for predicting each object needs to predict 5 values, namely the position of the grid and the probability value of the object; in which the self position needs 4 values to be determined, including the central point of the prediction boxThe width and height of the target and prediction boxes are t _x 、t _y 、t _w 、t _h ；

Wherein the latter two terms relate to the k value;

if the center target is offset in the cell relative to the upper left corner of the image (c) _x ，c _y ) The height and width of the anchor box is denoted as p _w And p _h Then, the specific calculation formula of the corrected bounding box is as follows:

b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

6. the improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 5, wherein the YOLOv3 model includes losses in three aspects, namely, a prediction box, a confidence level and a category, and the specific loss function is as follows:

And

is the corresponding true value;

representing the corresponding true probability value.

7. The improved yollov 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 3, wherein the improved yollov 3 model is specifically a new feature extraction network Wide-Darknet-33 is adopted to replace Darknet-53, and a 104 x 104 detection layer is added to improve the yollov 3 multi-scale detection network.

8. The improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 7, wherein Wide-Darknet-33 comprises 13 residual blocks, 32 convolutional layers, 1 fully-connected layer, the depth is reduced by reducing convolutional layers of Darknet-53, and the network is widened at the same time, so that the feature extraction is more accurate in width;

9. The improved YOLOv 3-based power distribution room pedestrian detection and trajectory tracking method according to claim 3, wherein when the prior frame is obtained, clustering is performed by using a K-means + + algorithm, specifically:

selecting initial K clustering center points by a wheel disc method, wherein the total number of samples is Q, and the clustering is K, and the specific clustering process is as follows:

10. A pedestrian detection and trajectory tracking system for a power distribution room, wherein the system adopts the method of any one of claims 1 to 9, and comprises a video frame selection module, an image segmentation and preprocessing module, a pedestrian detection module and a trajectory tracking identification module;