CN114612933B

CN114612933B - Monocular social distance detection tracking method

Info

Publication number: CN114612933B
Application number: CN202210241439.8A
Authority: CN
Inventors: 匡平; 冯旭东
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2023-04-07
Anticipated expiration: 2042-03-11
Also published as: CN114612933A

Abstract

The invention discloses a monocular social distance detection tracking method, which comprises the following steps: s1, carrying out pedestrian detection on a video image by using a YOLOv5 model; s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm; s3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera; s4, defining an interested area with a rectangular shape in a real scene, generating a bird 'S-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird' S-eye view and a road plane scale coefficient; and S4, recording the ID information of the pedestrians and giving an early warning if the distance between the pedestrians is smaller than a preset threshold value. The pedestrian detection and tracking method is based on the YOLOv5 model and DeepsORT, and the distance between pedestrians in the video can be accurately detected by matching with the camera calibration and the bird's-eye view image transformation. The accuracy rate of detection and tracking of the pedestrians is high, and the real-time performance is good.

Description

Monocular social distance detection tracking method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a monocular social distance detection tracking method.

Background

The world health organization reports that there are two modes of transmission of coronaviruses, respiratory droplets and any form of physical contact. These droplets are produced by the respiratory system when an infected person coughs or sneezes. If a person is present within 2 meters of distance, he is likely to inhale these droplets causing infection. Keeping a certain social distance is an effective way to prevent the spread of viruses, and is one of the best methods to prevent the spread of epidemic except wearing a mask. In intensive mobile places of personnel such as hospitals, markets, stations and the like, workers need to remind people of keeping safe social distance and wear the mask. In this case, artificial intelligence can play an important role in facilitating social distance monitoring. As a subset of artificial intelligence, computer vision has been very successful in solving various complex healthcare problems and has implemented applications in codv-19 identification based on chest ct scan or x-ray, thus it is contemplated that knowledge of computer vision can be applied to pedestrian social distance detection.

In a social distance detection system, three key issues are mainly considered:

(1) The knowledge of how to use machine vision detects the pedestrians in the video, and ensures higher accuracy and real-time performance.

(2) How to track the detected pedestrian.

(3) How to perform three-dimensional distance estimation on pedestrians.

1. Pedestrian detection method

In the field of computer vision and target detection, pedestrian detection is always one of research hotspots in recent years, the pedestrian detection mainly utilizes image processing and machine learning methods to locate pedestrians in images, the approximate position of each pedestrian is accurately predicted, and a relatively accurate pedestrian detection model is also one of prior conditions of follow-up image intelligent analysis such as tracking, re-identification, retrieval and the like. With the rapid development of the target detection technology, the pedestrian detection algorithm commonly used at present can achieve a good detection effect on pedestrians in a simple scene, but the performance of pedestrian detection still has a large promotion space in scenes with a large amount of crowd gathering, such as streets and markets in real life.

Pedestrian detection models can be divided into two broad categories. One is a two-stage model, which mainly comprises target positioning and target identification, the generation and judgment of the bounding box are divided into two processes, firstly a candidate frame is generated, and then the candidate frame is judged, and the existing models mainly comprise R-CNN, fast R-CNN and Faster R-CNN. The other model is a one-stage model, the detection speed of the model is high, the real-time requirement can be met, and the existing model mainly comprises SSD and YOLO series.

2. Pedestrian tracking method

The SORT algorithm is a simple real-time multi-target Tracking algorithm based on TBD (Tracking and Detection) proposed by A Bewley et al in 2016, combines Kalman filtering and Hungary algorithms, and can create a new ID (destroy an old ID) according to the entering of a new target (the leaving of an old target), thereby saving a large amount of data space. In 2017, the team proposed a DeepsORT algorithm, which continues to use the SORT algorithm and a Kalman filter frame in the Hungarian algorithm, distinguishes whether the target in the current frame is the same as the target in the previous frame by the Hungarian algorithm, and tracks the target by Kalman filtering. The DeepsORT adds a pedestrian re-identification network and appearance information to judge whether the detected pedestrian is repeated or not, realizes long-time tracking of the shielding target, and meanwhile, the DeepsORT also utilizes the CNN to extract and match features, so that ID switching in the SORT is reduced, and a good tracking effect can be realized in a high-speed video.

3. Three-dimensional space distance estimation method

There are two main solutions to three-dimensional spatial distance estimation, monocular vision and binocular vision. The binocular vision measures the distance by establishing a space model for the shot images of the same object from different angles by using two cameras, the method is generally more accurate, but the cost is also increased, and most of public places such as markets, stations, airports and the like are single cameras of station piles at present. Monocular vision, i.e., measuring three-dimensional spatial distance by one camera, is less accurate than a binocular solution, but is less costly. Because the image shot by the camera has the characteristics of big and small distance, the distance measurement cannot be carried out only by the monocular camera, other conditions must be matched, the monocular camera is calibrated by a common method, so that the camera calibration technology is also known, and the current common camera calibration technology mainly comprises three types:

(1) The traditional calibration method comprises the following steps: the method uses a reference object with known size, usually a black and white chess board, to establish the corresponding relation between the 3D world coordinate system and the 2D image coordinate system established by the reference object by using a geometric model and mathematical operation, so as to obtain the internal and external parameters of the lens. The method has high precision, and can calibrate any camera, but the position of the camera cannot be changed after calibration, otherwise, the calibration needs to be carried out again. The more developed traditional calibration method mainly comprises Zhangingyou plane calibration method and two-step calibration method proposed by Tsai.

(2) Self-calibration method: different from the traditional calibration method, the self-calibration method does not need the assistance of a reference object, and can directly calibrate the camera through a plurality of images shot by the camera to obtain internal and external parameters. The camera calibrated by the method can be changed in position, so that the method is flexible, but the method is generally low in precision and suitable for calibrating the camera in occasions with low precision requirements.

(3) The calibration method based on active vision comprises the following steps: the method is provided by Ma, and the camera is calibrated on the premise of mastering the motion parameters of the camera. This method, like the self-calibration method, does not require a reference object template, but requires some translational or rotational movement of the camera. The method is high in accuracy, but is only suitable for scenes supporting motion by a camera, and is not suitable for most scenes.

The current research on the social distance detection tracking system is as follows:

sultanpure et al have proposed an object recognition model that can help people to locate in public places by continuously using the YOLO object of video and pictures, reminding people to keep proper social distance detection and wearing masks. ShashiYadav proposes a computer vision-based method that detects social distance and mask by performing a model by masking motion on a fourth generation raspberry derivative, centered on a continuously observed individual, and identifying infringement behavior. In this framework, modern deep learning algorithms are combined with mathematical strategies and geometric techniques to build a powerful model covering three parts of recognition, tracking and calibration. Agarwal et al assembled a system that detects pedestrians using the YOLOv3 object recognition model and tracks the detected individuals with the help of bounding boxes and assigned ids using the Deepsort method, and then compares the results of the YOLOv3 model with other well-known models (fast RCNN and SSD) on the maps and FPSs and loss. Imran Ahmed et al provide a system for identifying people in video packets using YOLOv3, which adds an extra layer in the neural network to calculate the pedestrian information index, the recognition model uses the identified bounding box data to distinguish people, and finally, uses Euclidean distance to calculate the distance of the individual distinguishing bounding box centroids. Mahditrezaei et al established a model based on computer vision and YOLOv4, utilized common cctv surveillance cameras to perform mechanized pedestrian identification under indoor and outdoor conditions, and further proposed that the deep neural network model be combined with the adjusted IPM method and the sortt tracking algorithm to further enhance pedestrian detection and social distance inspection. Sergio Saponara et al propose an artificial framework for social distance grouping using thermal pictures, using a YOLOv 2-based approach, writing a deep learning-based detection program for distinguishing and tracking pedestrians in outdoor and indoor situations. Rinkal Keniya et al focused on identifying whether surrounding people kept social distance, they perceived the edge of an individual and displayed the name using a self-made model named "Socialdistancnent-19", and were classified as a dangerous group if the distance was less than a certain value.

Current social distance detection systems are more or less deficient in the accuracy of pedestrian detection and tracking and in the accuracy of distance estimation.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a monocular social distance detecting and tracking method which is based on a YOLOv5 model and DeepsORT, can accurately detect the distance between pedestrians in a video by matching with camera calibration and aerial view transformation, can remind the pedestrians violating the social distance, and has high accuracy and good real-time performance.

The purpose of the invention is realized by the following technical scheme: the monocular social distance detecting and tracking method comprises the following steps:

s1, carrying out pedestrian detection on a video image by using a YOLOv5 model;

s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm;

s3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera;

s4, defining an interested area with a rectangular shape in a real scene, measuring the length of the rectangle in the real world, generating a bird 'S-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird' S-eye view and a road plane scale coefficient;

and S5, recording pedestrian ID information and giving an early warning if the distance between pedestrians is smaller than a preset threshold value.

Further, the YOLOv5 model in step S1 has four modules, which are Input, backhaul, neck, and Head, respectively; the Backbone module sequentially comprises a Focus structure, four Conv structures and an SPP pyramid structure, a GhostBottleneck module is arranged between every two Conv structures, and three GhostBottleneck modules are arranged in total; SE layers are arranged between the second Conv structure and the second GhostBottleneck module, between the third Conv structure and the third GhostBottleneck module, and between the fourth Conv structure and the SPP pyramid structure.

Further, the Ghost module consists of two Ghost modules, and the first Ghost module serves as an extension layer and is used for increasing the number of channels; the second Ghost module is used for reducing the number of channels; the output of the second Ghost module is added with the input of the first Ghost module to be used as the output of the Ghost Bottleneck module.

Further, the SE layer is established at any one mapping F _tr ：X∈R ^{H′×W′×C′} →U∈R ^H×W×C Above, the convolution kernel is V = [ V = ₁ ，v ₂ ，...，v _c ]Wherein v is _c Representing the c-th convolution kernel, then the output is represented as U = [ U ] ₁ ，u ₂ ，...，u _c ]，u _c Expressed as:

wherein x represents the operation of convolution,

X＝[x ¹ ，x ² ，...，x ^C′ ]，u _c ∈R ^H×W ，/>

a 2-dimensional convolution kernel representing an s-channel; the SE layer separates the space characteristic relation and the channel characteristic relation obtained by convolution, so that the model directly learns the characteristic relation of the channel;

for the characteristic relation of the channel, the SE layer executes two operations, namely an Squeeze operation and an Excitation operation; firstly, carrying out Squeeze operation on an input channel feature graph to obtain global features of all channels; and then performing an Excitation operation, learning the dependency relationship among the channels to obtain the weight of each channel, and finally multiplying the weight by the original characteristic diagram to obtain the final characteristic.

Further, in step S4, the road plane scale factor k _x And k _y Expressed as:

wherein k is _x And k _y And the scaling coefficients respectively represent the X direction and the Y direction, w and h are actual lengths of the length and the width of the region of interest respectively, and w ' and h ' are pixel lengths of the length and the width of the region of interest in the bird's eye view respectively.

The beneficial effects of the invention are: compared with the existing social distance detector, the pedestrian detection and tracking method based on the improved YOLOv5 model and DeepsORT can detect and track the pedestrians, and can accurately detect the distance between the pedestrians in the video by matching with the camera calibration and the bird's-eye view image transformation, so that the pedestrians violating the social distance are reminded. The accuracy rate of detection and tracking of the pedestrians is high, and the real-time performance is good.

Drawings

FIG. 1 is a flow chart of a monocular social distance detection tracking method of the present invention;

FIG. 2 is a structural diagram of a prior YOLOv5 module;

FIG. 3 is a structural diagram of the YOLOv5 module composition of the present invention;

FIG. 4 is a schematic diagram of the Ghost module structure according to the present invention;

FIG. 5 shows the structure of the GhostBottleneck when the Stride of the present invention is 1 and 2, respectively;

FIG. 6 is a schematic structural diagram of a SE module according to the present invention;

FIG. 7 is a region of interest map of the present invention;

FIG. 8 is a bird's eye view of a region of interest of the present invention.

Detailed Description

The monocular vision of the invention refers to a single camera. The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, a monocular social distance detecting and tracking method of the present invention includes the following steps:

s1, carrying out pedestrian detection on a video image by using a YOLOv5 model;

although the YOLOv5 model adopts a CSP structure which can reduce network parameters, a Focus structure which can reduce information loss, and an SPP pyramid structure suitable for multi-size input, the real-time property is also considered by the social distance detection and tracking system, so the invention takes the CSP structure as an entry point and considers how to improve the model to increase the speed of pedestrian detection. YOLOv5 has four modules, which are Input, back bone, neck and Head respectively; the structure of the Backbone is shown in FIG. 2, and in the Backbone module, there is a structure consisting of 4 conv and 3 BottleneckCSPs alternately. The invention improves the Backbone, and uses the GhostBottleneck module to replace the original 3 Bottleneck CSP modules in the Backbone module, thereby improving the detection speed on the premise of not reducing the detection precision. In addition, the invention adopts a mode of increasing the SE layer, properly increases the calculation cost under proper conditions and improves the learning capability of the network characteristics. The improved model is shown in fig. 3. The Backbone module sequentially comprises a Focus structure, four Conv (convolution) structures and an SPP pyramid structure, wherein a GhostBottleneck module is arranged between every two Conv structures, and the total number of the GhostBottleneck modules is three; SE layers are arranged between the second Conv structure and the second GhostBottleneck module, between the third Conv structure and the third GhostBottleneck module, and between the fourth Conv structure and the SPP pyramid structure, and the outputs of the first two SE layers and the output of the SPP pyramid structure are jointly input into the Head module.

The GhostBottleneck module consists of two Ghost modules, and the Ghost modules can generate more feature maps through cheap operation. On the basis of a set of feature maps, a series of linear transformation is adopted, and a plurality of Ghost feature maps which can fully contain feature information are generated at low cost. The Ghost module is divided into three parts: convolution, ghost generation and feature map splicing. Firstly, obtaining feature mapping by using a traditional convolution method, then carrying out phi operation on the feature map of each channel to generate a Ghost feature map, and finally splicing the feature map obtained in the first step and the Ghost feature map to obtain final output. Fig. 4 shows the operations performed by the Ghost module when outputting the same number of feature maps, the output of convolution layer conv contains many redundant feature maps, the Φ operation is a cheap operation, similar to the 3 × 3 convolution operation, the size of the feature map generated by the original convolution layer is usually very small, and the Φ operation is performed to generate a "Ghost" of the corresponding feature map.

The first Ghost module serves as an extension layer and is used for increasing the number of channels, the second Ghost module is used for reducing the number of channels, the output of the second Ghost module is added with the input of the first Ghost module and then serves as the output of the Ghost Bottleneck module, and the number of the channels is adjusted through the Ghost module, so that the number of the channels of the two paths of added data is matched. Fig. 5 shows the ghost bottleeck structure when the step length stride is 1 and 2, respectively. When stride is 1, BN and ReLU are used in the first Ghost module, and BN is used only in the second Ghost. When stride is 2, a DepthWise convolution with stride of 2 is inserted between the two Ghost modules. Finally, considering the efficiency problem, in practical application, the convolution of the Ghost module adopts poitwise convolution. Stride is the meaning of step length, one step of movement and two steps of movement are performed at a time, a deep convolution is needed to be added in the middle of the two steps of movement, and the characteristic diagram of the one step of movement is almost the same as that of the original characteristic diagram; moving two steps results in less computation, and the user can select to move one or two steps as desired.

The general CNN network mainly sends an input feature map into a convolution kernel, outputs a new feature map through the operation of the convolution kernel, and the essence of the convolution is the feature fusion of the dimensions of space (H, W) and channel (C). The spatial feature relationship (H and W dimensions) and the channel feature relationship (C) learned by the convolution kernel are separated by SE operation, so that the model directly learns the feature relationship of the channel (C dimension), and the basic structure of the SE layer is as shown in fig. 6.

The SE layer can be established on any mapping F _tr ：X∈R ^{H′×W′×C′} →U∈R ^H×W×C If the convolution kernel is V = [ V = [ ] ₁ ，v ₂ ，...，v _c ]Wherein v is _c Representing the c-th convolution kernel, the output can be represented as U = [ U ] ₁ ，u ₂ ，...，u _c ]，u _c Can be expressed as:

wherein x represents the operation of convolution,

X＝[x ¹ ，x ² ，...，x ^C′ ]，u _c ∈R ^H×W ，/>

a 2-dimensional convolution kernel representing one s-channel. It can be seen from the formula that the output is generated by summing the convolution results of all channels, so that the spatial feature relationship and the channel feature relationship learned by the convolution kernel are mixed together, and the SE layer is used for separating the spatial feature relationship and the channel feature relationship, so that the model can directly learn the channel feature relationship.

For the characteristic relation of the channel, the SE layer executes two operations, namely an Squeeze operation and an Excitation operation; firstly, carrying out Squeeze operation on an input channel feature graph to obtain global features of all channels; then, performing an Excitation operation, learning the dependency relationship among the channels to obtain the weight of each channel, and finally multiplying the weight by the original feature map to obtain the final feature;

the Squeeze operation compresses the global space characteristics into one channel by using a global average pool to generate the statistical information of the channel; the output U generates the statistic z ∈ R by reducing its spatial dimension H × W ^C ，R ^C Represents a C-dimensional space R; the c-th statistic z _c Expressed as:

the Excitation operation employs a valve mechanism in the form of sigmoid:

s＝F _ex (z,W)＝σ(g(z,W))＝σ(W ₂ δ(W ₁ z))

δ () is a function of the ReLU,

the dimensionality reduction coefficient is r, which is a hyper-parameter; the specification operation adopts a bottleeck structure comprising two FC layers, firstly, the dimension reduction processing is carried out through the first FC layer, then, the ReLU activation is carried out, and finally, the original dimension is converted through the second FC layer;

multiplying the learned sigmoid activation values of all channels by the initial features on U to obtain final features:

x' _C ＝F _scale (u _c ,s)＝su _c 。

s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm; the DeepsORT mainly utilizes Hungarian algorithm to distinguish whether the target in the current frame is the same as the target in the previous frame or not, and utilizes Kalman filtering to track the target. And matching the pedestrian detection model in the step S1, tracking and matching a 12-minute video of people walking on a square, wherein the video is often shielded, overlapped and crowded, and the diversity of costumes and appearances of pedestrians in real world public places is reflected. Compared with the current advanced pedestrian detection models of Faster R-CNN, SSD and YOLOv5 in precision, recall rate, FPS, IDSW and MOTA. The results are shown in Table 1.

TABLE 1

Model (model)	Accuracy of measurement	Recall rate	FPS	IDSW	MOTA
						Faster R-CNN	96.9	83.6	28	381	30.9
SSD	79.1	80.0	36	357	30.0
						YOLOv5	83.6	61.1	53	306	30.4
Improved YOLOv5	92.6	75.3	68	289	30.6

S3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera; the calibration principle refers to a calibration method of Zhangyinyou professor; carrying out distortion removal processing on the image;

s4, defining a rectangular region of interest in a real scene, and measuring the length of a rectangle in the real world, as shown in FIG. 7; then generating a bird's-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird's-eye view and a road plane scale coefficient as shown in FIG. 8; the mapping from any point in the original image plane to the corresponding point in the aerial view can be realized through perspective change. After obtaining the bird's-eye view of the region of interest, since the bird's-eye view has the characteristics of being uniformly distributed in the horizontal direction and the vertical direction and the proportion of the bird's-eye view in the horizontal direction and the proportion of the bird's-eye view in the vertical direction are different, the proportion coefficient k of the bird's-eye view to the road plane needs to be obtained _x And k _y The inter-pedestrian distance can be accurately estimated in the bird's eye view; road plane proportionality coefficient k _x And k _y Expressed as:

wherein k is _x And k _y Representing the scaling coefficients in the X and Y directions, w and h are the actual lengths of the length and width of the region of interest, respectively, and w ' and h ' are the pixel lengths of the length and width of the region of interest in the bird's eye view image, respectively。

The distance between two persons is calculated by looking at the distance of pixels in the horizontal direction, wherein each pixel represents the distance in the horizontal direction, the distance of pixels in the vertical direction represents the distance, and the distance between two persons is calculated.

And S5, recording the ID information of the pedestrians and giving an early warning if the distance between the pedestrians is smaller than a preset threshold value.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. The monocular social distance detection and tracking method is characterized by comprising the following steps of:

s1, carrying out pedestrian detection on a video image by using a YOLOv5 model; the YOLOv5 model has four modules, namely Input, back bone, neck and Head; the Backbone module sequentially comprises a Focus structure, four Conv structures and an SPP pyramid structure, a GhostBottleneck module is arranged between every two Conv structures, and three GhostBottleneck modules are arranged in total; SE layers are arranged between the second Conv structure and the second GhostBottleneck module, between the third Conv structure and the third GhostBott1eneck module, and between the fourth Conv structure and the SPP pyramid structure;

the GhostBottleneck module consists of two Ghost modules, wherein the first Ghost module serves as an expansion layer and is used for increasing the number of channels; the second Ghost module is used for reducing the number of channels; the output of the second Ghost module is added with the input of the first Ghost module to be used as the output of the Ghost Bottleneck module;

the SE layer is established on any mapping F _tr ：X∈R ^{H′×W′×C′} →U∈R ^H×W×C Above, the convolution kernel is V = [ V = [) ₁ ，v ₂ ，...，v _c ]Wherein v is _c Representing the c-th convolution kernel, then the output is represented as U = [ U = ₁ ，u ₂ ，...，u _c ]，u _c Expressed as:

wherein x represents the operation of convolution,

X＝[x ¹ ，x ² ，...，x ^C′ ]，u _c ∈R ^H×W ，/>

a 2-dimensional convolution kernel representing an s-channel; the SE layer separates the spatial characteristic relationship and the channel characteristic relationship obtained by convolution, so that the model directly learns the characteristic relationship of the channel;

the Squeeze operation compresses global space characteristics into one channel by using a global average pool to generate statistical information of the channel; the output U generates the statistic z ∈ R by reducing its spatial dimension H × W ^C ，R ^C Represents a C-dimensional space R; the c-th statistic z _c Expressed as:

the Excitation operation adopts a valve mechanism in the form of sigmoid:

s＝F _ex (z，W)＝σ(g(z，W))＝σ(W ₂ δ(W ₁ z))

δ () is a function of the ReLU,

x′ _C ＝F _scale (u _c ，s)＝su _c ；

2. The monocular social distance detecting and tracking method according to claim 1, wherein in step S4, the road plane scale coefficient k is _x And k _y Expressed as:

wherein k is _x And k _y The scaling factors in the X and Y directions are represented, w and h are the actual lengths of the length and width of the region of interest, respectively, and w ' and h ' are the pixel lengths of the length and width of the region of interest in the bird's eye view, respectively.