CN114612933B - Monocular social distance detection tracking method - Google Patents

Monocular social distance detection tracking method Download PDF

Info

Publication number
CN114612933B
CN114612933B CN202210241439.8A CN202210241439A CN114612933B CN 114612933 B CN114612933 B CN 114612933B CN 202210241439 A CN202210241439 A CN 202210241439A CN 114612933 B CN114612933 B CN 114612933B
Authority
CN
China
Prior art keywords
pedestrians
module
distance
channel
ghost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210241439.8A
Other languages
Chinese (zh)
Other versions
CN114612933A (en
Inventor
匡平
冯旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210241439.8A priority Critical patent/CN114612933B/en
Publication of CN114612933A publication Critical patent/CN114612933A/en
Application granted granted Critical
Publication of CN114612933B publication Critical patent/CN114612933B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a monocular social distance detection tracking method, which comprises the following steps: s1, carrying out pedestrian detection on a video image by using a YOLOv5 model; s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm; s3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera; s4, defining an interested area with a rectangular shape in a real scene, generating a bird 'S-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird' S-eye view and a road plane scale coefficient; and S4, recording the ID information of the pedestrians and giving an early warning if the distance between the pedestrians is smaller than a preset threshold value. The pedestrian detection and tracking method is based on the YOLOv5 model and DeepsORT, and the distance between pedestrians in the video can be accurately detected by matching with the camera calibration and the bird's-eye view image transformation. The accuracy rate of detection and tracking of the pedestrians is high, and the real-time performance is good.

Description

Monocular social distance detection tracking method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a monocular social distance detection tracking method.
Background
The world health organization reports that there are two modes of transmission of coronaviruses, respiratory droplets and any form of physical contact. These droplets are produced by the respiratory system when an infected person coughs or sneezes. If a person is present within 2 meters of distance, he is likely to inhale these droplets causing infection. Keeping a certain social distance is an effective way to prevent the spread of viruses, and is one of the best methods to prevent the spread of epidemic except wearing a mask. In intensive mobile places of personnel such as hospitals, markets, stations and the like, workers need to remind people of keeping safe social distance and wear the mask. In this case, artificial intelligence can play an important role in facilitating social distance monitoring. As a subset of artificial intelligence, computer vision has been very successful in solving various complex healthcare problems and has implemented applications in codv-19 identification based on chest ct scan or x-ray, thus it is contemplated that knowledge of computer vision can be applied to pedestrian social distance detection.
In a social distance detection system, three key issues are mainly considered:
(1) The knowledge of how to use machine vision detects the pedestrians in the video, and ensures higher accuracy and real-time performance.
(2) How to track the detected pedestrian.
(3) How to perform three-dimensional distance estimation on pedestrians.
1. Pedestrian detection method
In the field of computer vision and target detection, pedestrian detection is always one of research hotspots in recent years, the pedestrian detection mainly utilizes image processing and machine learning methods to locate pedestrians in images, the approximate position of each pedestrian is accurately predicted, and a relatively accurate pedestrian detection model is also one of prior conditions of follow-up image intelligent analysis such as tracking, re-identification, retrieval and the like. With the rapid development of the target detection technology, the pedestrian detection algorithm commonly used at present can achieve a good detection effect on pedestrians in a simple scene, but the performance of pedestrian detection still has a large promotion space in scenes with a large amount of crowd gathering, such as streets and markets in real life.
Pedestrian detection models can be divided into two broad categories. One is a two-stage model, which mainly comprises target positioning and target identification, the generation and judgment of the bounding box are divided into two processes, firstly a candidate frame is generated, and then the candidate frame is judged, and the existing models mainly comprise R-CNN, fast R-CNN and Faster R-CNN. The other model is a one-stage model, the detection speed of the model is high, the real-time requirement can be met, and the existing model mainly comprises SSD and YOLO series.
2. Pedestrian tracking method
The SORT algorithm is a simple real-time multi-target Tracking algorithm based on TBD (Tracking and Detection) proposed by A Bewley et al in 2016, combines Kalman filtering and Hungary algorithms, and can create a new ID (destroy an old ID) according to the entering of a new target (the leaving of an old target), thereby saving a large amount of data space. In 2017, the team proposed a DeepsORT algorithm, which continues to use the SORT algorithm and a Kalman filter frame in the Hungarian algorithm, distinguishes whether the target in the current frame is the same as the target in the previous frame by the Hungarian algorithm, and tracks the target by Kalman filtering. The DeepsORT adds a pedestrian re-identification network and appearance information to judge whether the detected pedestrian is repeated or not, realizes long-time tracking of the shielding target, and meanwhile, the DeepsORT also utilizes the CNN to extract and match features, so that ID switching in the SORT is reduced, and a good tracking effect can be realized in a high-speed video.
3. Three-dimensional space distance estimation method
There are two main solutions to three-dimensional spatial distance estimation, monocular vision and binocular vision. The binocular vision measures the distance by establishing a space model for the shot images of the same object from different angles by using two cameras, the method is generally more accurate, but the cost is also increased, and most of public places such as markets, stations, airports and the like are single cameras of station piles at present. Monocular vision, i.e., measuring three-dimensional spatial distance by one camera, is less accurate than a binocular solution, but is less costly. Because the image shot by the camera has the characteristics of big and small distance, the distance measurement cannot be carried out only by the monocular camera, other conditions must be matched, the monocular camera is calibrated by a common method, so that the camera calibration technology is also known, and the current common camera calibration technology mainly comprises three types:
(1) The traditional calibration method comprises the following steps: the method uses a reference object with known size, usually a black and white chess board, to establish the corresponding relation between the 3D world coordinate system and the 2D image coordinate system established by the reference object by using a geometric model and mathematical operation, so as to obtain the internal and external parameters of the lens. The method has high precision, and can calibrate any camera, but the position of the camera cannot be changed after calibration, otherwise, the calibration needs to be carried out again. The more developed traditional calibration method mainly comprises Zhangingyou plane calibration method and two-step calibration method proposed by Tsai.
(2) Self-calibration method: different from the traditional calibration method, the self-calibration method does not need the assistance of a reference object, and can directly calibrate the camera through a plurality of images shot by the camera to obtain internal and external parameters. The camera calibrated by the method can be changed in position, so that the method is flexible, but the method is generally low in precision and suitable for calibrating the camera in occasions with low precision requirements.
(3) The calibration method based on active vision comprises the following steps: the method is provided by Ma, and the camera is calibrated on the premise of mastering the motion parameters of the camera. This method, like the self-calibration method, does not require a reference object template, but requires some translational or rotational movement of the camera. The method is high in accuracy, but is only suitable for scenes supporting motion by a camera, and is not suitable for most scenes.
The current research on the social distance detection tracking system is as follows:
sultanpure et al have proposed an object recognition model that can help people to locate in public places by continuously using the YOLO object of video and pictures, reminding people to keep proper social distance detection and wearing masks. ShashiYadav proposes a computer vision-based method that detects social distance and mask by performing a model by masking motion on a fourth generation raspberry derivative, centered on a continuously observed individual, and identifying infringement behavior. In this framework, modern deep learning algorithms are combined with mathematical strategies and geometric techniques to build a powerful model covering three parts of recognition, tracking and calibration. Agarwal et al assembled a system that detects pedestrians using the YOLOv3 object recognition model and tracks the detected individuals with the help of bounding boxes and assigned ids using the Deepsort method, and then compares the results of the YOLOv3 model with other well-known models (fast RCNN and SSD) on the maps and FPSs and loss. Imran Ahmed et al provide a system for identifying people in video packets using YOLOv3, which adds an extra layer in the neural network to calculate the pedestrian information index, the recognition model uses the identified bounding box data to distinguish people, and finally, uses Euclidean distance to calculate the distance of the individual distinguishing bounding box centroids. Mahditrezaei et al established a model based on computer vision and YOLOv4, utilized common cctv surveillance cameras to perform mechanized pedestrian identification under indoor and outdoor conditions, and further proposed that the deep neural network model be combined with the adjusted IPM method and the sortt tracking algorithm to further enhance pedestrian detection and social distance inspection. Sergio Saponara et al propose an artificial framework for social distance grouping using thermal pictures, using a YOLOv 2-based approach, writing a deep learning-based detection program for distinguishing and tracking pedestrians in outdoor and indoor situations. Rinkal Keniya et al focused on identifying whether surrounding people kept social distance, they perceived the edge of an individual and displayed the name using a self-made model named "Socialdistancnent-19", and were classified as a dangerous group if the distance was less than a certain value.
Current social distance detection systems are more or less deficient in the accuracy of pedestrian detection and tracking and in the accuracy of distance estimation.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a monocular social distance detecting and tracking method which is based on a YOLOv5 model and DeepsORT, can accurately detect the distance between pedestrians in a video by matching with camera calibration and aerial view transformation, can remind the pedestrians violating the social distance, and has high accuracy and good real-time performance.
The purpose of the invention is realized by the following technical scheme: the monocular social distance detecting and tracking method comprises the following steps:
s1, carrying out pedestrian detection on a video image by using a YOLOv5 model;
s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm;
s3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera;
s4, defining an interested area with a rectangular shape in a real scene, measuring the length of the rectangle in the real world, generating a bird 'S-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird' S-eye view and a road plane scale coefficient;
and S5, recording pedestrian ID information and giving an early warning if the distance between pedestrians is smaller than a preset threshold value.
Further, the YOLOv5 model in step S1 has four modules, which are Input, backhaul, neck, and Head, respectively; the Backbone module sequentially comprises a Focus structure, four Conv structures and an SPP pyramid structure, a GhostBottleneck module is arranged between every two Conv structures, and three GhostBottleneck modules are arranged in total; SE layers are arranged between the second Conv structure and the second GhostBottleneck module, between the third Conv structure and the third GhostBottleneck module, and between the fourth Conv structure and the SPP pyramid structure.
Further, the Ghost module consists of two Ghost modules, and the first Ghost module serves as an extension layer and is used for increasing the number of channels; the second Ghost module is used for reducing the number of channels; the output of the second Ghost module is added with the input of the first Ghost module to be used as the output of the Ghost Bottleneck module.
Further, the SE layer is established at any one mapping F tr :X∈R H′×W′×C′ →U∈R H×W×C Above, the convolution kernel is V = [ V = 1 ,v 2 ,...,v c ]Wherein v is c Representing the c-th convolution kernel, then the output is represented as U = [ U ] 1 ,u 2 ,...,u c ],u c Expressed as:
Figure GDA0004047589990000041
wherein x represents the operation of convolution,
Figure GDA0004047589990000042
X=[x 1 ,x 2 ,...,x C′ ],u c ∈R H×W ,/>
Figure GDA0004047589990000043
a 2-dimensional convolution kernel representing an s-channel; the SE layer separates the space characteristic relation and the channel characteristic relation obtained by convolution, so that the model directly learns the characteristic relation of the channel;
for the characteristic relation of the channel, the SE layer executes two operations, namely an Squeeze operation and an Excitation operation; firstly, carrying out Squeeze operation on an input channel feature graph to obtain global features of all channels; and then performing an Excitation operation, learning the dependency relationship among the channels to obtain the weight of each channel, and finally multiplying the weight by the original characteristic diagram to obtain the final characteristic.
Further, in step S4, the road plane scale factor k x And k y Expressed as:
Figure GDA0004047589990000044
wherein k is x And k y And the scaling coefficients respectively represent the X direction and the Y direction, w and h are actual lengths of the length and the width of the region of interest respectively, and w ' and h ' are pixel lengths of the length and the width of the region of interest in the bird's eye view respectively.
The beneficial effects of the invention are: compared with the existing social distance detector, the pedestrian detection and tracking method based on the improved YOLOv5 model and DeepsORT can detect and track the pedestrians, and can accurately detect the distance between the pedestrians in the video by matching with the camera calibration and the bird's-eye view image transformation, so that the pedestrians violating the social distance are reminded. The accuracy rate of detection and tracking of the pedestrians is high, and the real-time performance is good.
Drawings
FIG. 1 is a flow chart of a monocular social distance detection tracking method of the present invention;
FIG. 2 is a structural diagram of a prior YOLOv5 module;
FIG. 3 is a structural diagram of the YOLOv5 module composition of the present invention;
FIG. 4 is a schematic diagram of the Ghost module structure according to the present invention;
FIG. 5 shows the structure of the GhostBottleneck when the Stride of the present invention is 1 and 2, respectively;
FIG. 6 is a schematic structural diagram of a SE module according to the present invention;
FIG. 7 is a region of interest map of the present invention;
FIG. 8 is a bird's eye view of a region of interest of the present invention.
Detailed Description
The monocular vision of the invention refers to a single camera. The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, a monocular social distance detecting and tracking method of the present invention includes the following steps:
s1, carrying out pedestrian detection on a video image by using a YOLOv5 model;
although the YOLOv5 model adopts a CSP structure which can reduce network parameters, a Focus structure which can reduce information loss, and an SPP pyramid structure suitable for multi-size input, the real-time property is also considered by the social distance detection and tracking system, so the invention takes the CSP structure as an entry point and considers how to improve the model to increase the speed of pedestrian detection. YOLOv5 has four modules, which are Input, back bone, neck and Head respectively; the structure of the Backbone is shown in FIG. 2, and in the Backbone module, there is a structure consisting of 4 conv and 3 BottleneckCSPs alternately. The invention improves the Backbone, and uses the GhostBottleneck module to replace the original 3 Bottleneck CSP modules in the Backbone module, thereby improving the detection speed on the premise of not reducing the detection precision. In addition, the invention adopts a mode of increasing the SE layer, properly increases the calculation cost under proper conditions and improves the learning capability of the network characteristics. The improved model is shown in fig. 3. The Backbone module sequentially comprises a Focus structure, four Conv (convolution) structures and an SPP pyramid structure, wherein a GhostBottleneck module is arranged between every two Conv structures, and the total number of the GhostBottleneck modules is three; SE layers are arranged between the second Conv structure and the second GhostBottleneck module, between the third Conv structure and the third GhostBottleneck module, and between the fourth Conv structure and the SPP pyramid structure, and the outputs of the first two SE layers and the output of the SPP pyramid structure are jointly input into the Head module.
The GhostBottleneck module consists of two Ghost modules, and the Ghost modules can generate more feature maps through cheap operation. On the basis of a set of feature maps, a series of linear transformation is adopted, and a plurality of Ghost feature maps which can fully contain feature information are generated at low cost. The Ghost module is divided into three parts: convolution, ghost generation and feature map splicing. Firstly, obtaining feature mapping by using a traditional convolution method, then carrying out phi operation on the feature map of each channel to generate a Ghost feature map, and finally splicing the feature map obtained in the first step and the Ghost feature map to obtain final output. Fig. 4 shows the operations performed by the Ghost module when outputting the same number of feature maps, the output of convolution layer conv contains many redundant feature maps, the Φ operation is a cheap operation, similar to the 3 × 3 convolution operation, the size of the feature map generated by the original convolution layer is usually very small, and the Φ operation is performed to generate a "Ghost" of the corresponding feature map.
The first Ghost module serves as an extension layer and is used for increasing the number of channels, the second Ghost module is used for reducing the number of channels, the output of the second Ghost module is added with the input of the first Ghost module and then serves as the output of the Ghost Bottleneck module, and the number of the channels is adjusted through the Ghost module, so that the number of the channels of the two paths of added data is matched. Fig. 5 shows the ghost bottleeck structure when the step length stride is 1 and 2, respectively. When stride is 1, BN and ReLU are used in the first Ghost module, and BN is used only in the second Ghost. When stride is 2, a DepthWise convolution with stride of 2 is inserted between the two Ghost modules. Finally, considering the efficiency problem, in practical application, the convolution of the Ghost module adopts poitwise convolution. Stride is the meaning of step length, one step of movement and two steps of movement are performed at a time, a deep convolution is needed to be added in the middle of the two steps of movement, and the characteristic diagram of the one step of movement is almost the same as that of the original characteristic diagram; moving two steps results in less computation, and the user can select to move one or two steps as desired.
The general CNN network mainly sends an input feature map into a convolution kernel, outputs a new feature map through the operation of the convolution kernel, and the essence of the convolution is the feature fusion of the dimensions of space (H, W) and channel (C). The spatial feature relationship (H and W dimensions) and the channel feature relationship (C) learned by the convolution kernel are separated by SE operation, so that the model directly learns the feature relationship of the channel (C dimension), and the basic structure of the SE layer is as shown in fig. 6.
The SE layer can be established on any mapping F tr :X∈R H′×W′×C′ →U∈R H×W×C If the convolution kernel is V = [ V = [ ] 1 ,v 2 ,...,v c ]Wherein v is c Representing the c-th convolution kernel, the output can be represented as U = [ U ] 1 ,u 2 ,...,u c ],u c Can be expressed as:
Figure GDA0004047589990000061
wherein x represents the operation of convolution,
Figure GDA0004047589990000062
X=[x 1 ,x 2 ,...,x C′ ],u c ∈R H×W ,/>
Figure GDA0004047589990000063
a 2-dimensional convolution kernel representing one s-channel. It can be seen from the formula that the output is generated by summing the convolution results of all channels, so that the spatial feature relationship and the channel feature relationship learned by the convolution kernel are mixed together, and the SE layer is used for separating the spatial feature relationship and the channel feature relationship, so that the model can directly learn the channel feature relationship.
For the characteristic relation of the channel, the SE layer executes two operations, namely an Squeeze operation and an Excitation operation; firstly, carrying out Squeeze operation on an input channel feature graph to obtain global features of all channels; then, performing an Excitation operation, learning the dependency relationship among the channels to obtain the weight of each channel, and finally multiplying the weight by the original feature map to obtain the final feature;
the Squeeze operation compresses the global space characteristics into one channel by using a global average pool to generate the statistical information of the channel; the output U generates the statistic z ∈ R by reducing its spatial dimension H × W C ,R C Represents a C-dimensional space R; the c-th statistic z c Expressed as:
Figure GDA0004047589990000071
the Excitation operation employs a valve mechanism in the form of sigmoid:
s=F ex (z,W)=σ(g(z,W))=σ(W 2 δ(W 1 z))
δ () is a function of the ReLU,
Figure GDA0004047589990000072
the dimensionality reduction coefficient is r, which is a hyper-parameter; the specification operation adopts a bottleeck structure comprising two FC layers, firstly, the dimension reduction processing is carried out through the first FC layer, then, the ReLU activation is carried out, and finally, the original dimension is converted through the second FC layer;
multiplying the learned sigmoid activation values of all channels by the initial features on U to obtain final features:
x' C =F scale (u c ,s)=su c
s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm; the DeepsORT mainly utilizes Hungarian algorithm to distinguish whether the target in the current frame is the same as the target in the previous frame or not, and utilizes Kalman filtering to track the target. And matching the pedestrian detection model in the step S1, tracking and matching a 12-minute video of people walking on a square, wherein the video is often shielded, overlapped and crowded, and the diversity of costumes and appearances of pedestrians in real world public places is reflected. Compared with the current advanced pedestrian detection models of Faster R-CNN, SSD and YOLOv5 in precision, recall rate, FPS, IDSW and MOTA. The results are shown in Table 1.
TABLE 1
Model (model) Accuracy of measurement Recall rate FPS IDSW MOTA
Faster R-CNN 96.9 83.6 28 381 30.9
SSD 79.1 80.0 36 357 30.0
YOLOv5 83.6 61.1 53 306 30.4
Improved YOLOv5 92.6 75.3 68 289 30.6
S3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera; the calibration principle refers to a calibration method of Zhangyinyou professor; carrying out distortion removal processing on the image;
s4, defining a rectangular region of interest in a real scene, and measuring the length of a rectangle in the real world, as shown in FIG. 7; then generating a bird's-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird's-eye view and a road plane scale coefficient as shown in FIG. 8; the mapping from any point in the original image plane to the corresponding point in the aerial view can be realized through perspective change. After obtaining the bird's-eye view of the region of interest, since the bird's-eye view has the characteristics of being uniformly distributed in the horizontal direction and the vertical direction and the proportion of the bird's-eye view in the horizontal direction and the proportion of the bird's-eye view in the vertical direction are different, the proportion coefficient k of the bird's-eye view to the road plane needs to be obtained x And k y The inter-pedestrian distance can be accurately estimated in the bird's eye view; road plane proportionality coefficient k x And k y Expressed as:
Figure GDA0004047589990000081
wherein k is x And k y Representing the scaling coefficients in the X and Y directions, w and h are the actual lengths of the length and width of the region of interest, respectively, and w ' and h ' are the pixel lengths of the length and width of the region of interest in the bird's eye view image, respectively。
The distance between two persons is calculated by looking at the distance of pixels in the horizontal direction, wherein each pixel represents the distance in the horizontal direction, the distance of pixels in the vertical direction represents the distance, and the distance between two persons is calculated.
And S5, recording the ID information of the pedestrians and giving an early warning if the distance between the pedestrians is smaller than a preset threshold value.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (2)

1. The monocular social distance detection and tracking method is characterized by comprising the following steps of:
s1, carrying out pedestrian detection on a video image by using a YOLOv5 model; the YOLOv5 model has four modules, namely Input, back bone, neck and Head; the Backbone module sequentially comprises a Focus structure, four Conv structures and an SPP pyramid structure, a GhostBottleneck module is arranged between every two Conv structures, and three GhostBottleneck modules are arranged in total; SE layers are arranged between the second Conv structure and the second GhostBottleneck module, between the third Conv structure and the third GhostBott1eneck module, and between the fourth Conv structure and the SPP pyramid structure;
the GhostBottleneck module consists of two Ghost modules, wherein the first Ghost module serves as an expansion layer and is used for increasing the number of channels; the second Ghost module is used for reducing the number of channels; the output of the second Ghost module is added with the input of the first Ghost module to be used as the output of the Ghost Bottleneck module;
the SE layer is established on any mapping F tr :X∈R H′×W′×C′ →U∈R H×W×C Above, the convolution kernel is V = [ V = [) 1 ,v 2 ,...,v c ]Wherein v is c Representing the c-th convolution kernel, then the output is represented as U = [ U = 1 ,u 2 ,...,u c ],u c Expressed as:
Figure FDA0004047589980000011
wherein x represents the operation of convolution,
Figure FDA0004047589980000012
X=[x 1 ,x 2 ,...,x C′ ],u c ∈R H×W ,/>
Figure FDA0004047589980000013
a 2-dimensional convolution kernel representing an s-channel; the SE layer separates the spatial characteristic relationship and the channel characteristic relationship obtained by convolution, so that the model directly learns the characteristic relationship of the channel;
for the characteristic relation of the channel, the SE layer executes two operations, namely an Squeeze operation and an Excitation operation; firstly, carrying out Squeeze operation on an input channel feature graph to obtain global features of all channels; then, performing an Excitation operation, learning the dependency relationship among the channels to obtain the weight of each channel, and finally multiplying the weight by the original feature map to obtain the final feature;
the Squeeze operation compresses global space characteristics into one channel by using a global average pool to generate statistical information of the channel; the output U generates the statistic z ∈ R by reducing its spatial dimension H × W C ,R C Represents a C-dimensional space R; the c-th statistic z c Expressed as:
Figure FDA0004047589980000014
the Excitation operation adopts a valve mechanism in the form of sigmoid:
s=F ex (z,W)=σ(g(z,W))=σ(W 2 δ(W 1 z))
δ () is a function of the ReLU,
Figure FDA0004047589980000021
the dimensionality reduction coefficient is r, which is a hyper-parameter; the specification operation adopts a bottleeck structure comprising two FC layers, firstly, the dimension reduction processing is carried out through the first FC layer, then, the ReLU activation is carried out, and finally, the original dimension is converted through the second FC layer;
multiplying the learned sigmoid activation values of all channels by the initial features on U to obtain final features:
x′ C =F scale (u c ,s)=su c
s2, completing the tracking of the pedestrians and the matching work of each pedestrian ID by using a DeepsORT algorithm;
s3, calibrating the camera by a Zhang calibration method to obtain internal parameters and distortion parameters of the camera;
s4, defining an interested area with a rectangular shape in a real scene, measuring the length of the rectangle in the real world, generating a bird 'S-eye view through inverse perspective transformation, and estimating the distance between pedestrians by using the bird' S-eye view and a road plane scale coefficient;
and S5, recording pedestrian ID information and giving an early warning if the distance between pedestrians is smaller than a preset threshold value.
2. The monocular social distance detecting and tracking method according to claim 1, wherein in step S4, the road plane scale coefficient k is x And k y Expressed as:
Figure FDA0004047589980000022
wherein k is x And k y The scaling factors in the X and Y directions are represented, w and h are the actual lengths of the length and width of the region of interest, respectively, and w ' and h ' are the pixel lengths of the length and width of the region of interest in the bird's eye view, respectively.
CN202210241439.8A 2022-03-11 2022-03-11 Monocular social distance detection tracking method Active CN114612933B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210241439.8A CN114612933B (en) 2022-03-11 2022-03-11 Monocular social distance detection tracking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210241439.8A CN114612933B (en) 2022-03-11 2022-03-11 Monocular social distance detection tracking method

Publications (2)

Publication Number Publication Date
CN114612933A CN114612933A (en) 2022-06-10
CN114612933B true CN114612933B (en) 2023-04-07

Family

ID=81863981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210241439.8A Active CN114612933B (en) 2022-03-11 2022-03-11 Monocular social distance detection tracking method

Country Status (1)

Country Link
CN (1) CN114612933B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497303A (en) * 2022-08-19 2022-12-20 招商新智科技有限公司 Expressway vehicle speed detection method and system under complex detection condition
CN116580066B (en) * 2023-07-04 2023-10-03 广州英码信息科技有限公司 Pedestrian target tracking method under low frame rate scene and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476822A (en) * 2020-04-08 2020-07-31 浙江大学 Laser radar target detection and motion tracking method based on scene flow
CN113283408A (en) * 2021-07-22 2021-08-20 中国人民解放军国防科技大学 Monitoring video-based social distance monitoring method, device, equipment and medium
WO2022000094A1 (en) * 2020-07-03 2022-01-06 Invision Ai, Inc. Video-based tracking systems and methods

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220028535A1 (en) * 2020-07-27 2022-01-27 VergeSense, Inc. Method for mitigating disease transmission in a facility
CN112683228A (en) * 2020-11-26 2021-04-20 深兰人工智能(深圳)有限公司 Monocular camera ranging method and device
CN113192646B (en) * 2021-04-25 2024-03-22 北京易华录信息技术股份有限公司 Target detection model construction method and device for monitoring distance between different targets
CN113985897A (en) * 2021-12-15 2022-01-28 北京工业大学 Mobile robot path planning method based on pedestrian trajectory prediction and social constraint

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476822A (en) * 2020-04-08 2020-07-31 浙江大学 Laser radar target detection and motion tracking method based on scene flow
WO2022000094A1 (en) * 2020-07-03 2022-01-06 Invision Ai, Inc. Video-based tracking systems and methods
CN113283408A (en) * 2021-07-22 2021-08-20 中国人民解放军国防科技大学 Monitoring video-based social distance monitoring method, device, equipment and medium

Also Published As

Publication number Publication date
CN114612933A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN111462200B (en) Cross-video pedestrian positioning and tracking method, system and equipment
CN109934848B (en) Method for accurately positioning moving object based on deep learning
Sidla et al. Pedestrian detection and tracking for counting applications in crowded situations
CN104778690B (en) A kind of multi-target orientation method based on camera network
CN114612933B (en) Monocular social distance detection tracking method
CN109064484B (en) Crowd movement behavior identification method based on fusion of subgroup component division and momentum characteristics
CN111144207B (en) Human body detection and tracking method based on multi-mode information perception
CN111160291B (en) Human eye detection method based on depth information and CNN
WO2018076392A1 (en) Pedestrian statistical method and apparatus based on recognition of parietal region of human body
CN104794737A (en) Depth-information-aided particle filter tracking method
Tian et al. Absolute head pose estimation from overhead wide-angle cameras
CN103729620A (en) Multi-view pedestrian detection method based on multi-view Bayesian network
Saif et al. Crowd density estimation from autonomous drones using deep learning: challenges and applications
CN113762009B (en) Crowd counting method based on multi-scale feature fusion and double-attention mechanism
CN102005052A (en) Occluded human body tracking method based on kernel density estimation
CN104123569A (en) Video person number information statistics method based on supervised learning
CN111881841A (en) Face detection and recognition method based on binocular vision
CN115880643A (en) Social distance monitoring method and device based on target detection algorithm
CN108694348B (en) Tracking registration method and device based on natural features
CN106023252A (en) Multi-camera human body tracking method based on OAB algorithm
Min et al. COEB-SLAM: A Robust VSLAM in Dynamic Environments Combined Object Detection, Epipolar Geometry Constraint, and Blur Filtering
Lo et al. Vanishing point-based line sampling for real-time people localization
Li et al. Robust object tracking in crowd dynamic scenes using explicit stereo depth
CN114372996A (en) Pedestrian track generation method oriented to indoor scene
Anbalagan et al. Deep learning based real-time COVID norms violation detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant