CN116805337B - Crowd positioning method based on trans-scale visual transformation network - Google Patents
Crowd positioning method based on trans-scale visual transformation network Download PDFInfo
- Publication number
- CN116805337B CN116805337B CN202311074895.9A CN202311074895A CN116805337B CN 116805337 B CN116805337 B CN 116805337B CN 202311074895 A CN202311074895 A CN 202311074895A CN 116805337 B CN116805337 B CN 116805337B
- Authority
- CN
- China
- Prior art keywords
- scale
- training
- module
- crowd image
- crowd
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000009466 transformation Effects 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000000007 visual effect Effects 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 158
- 238000006243 chemical reaction Methods 0.000 claims abstract description 87
- 230000004927 fusion Effects 0.000 claims abstract description 36
- 108091026890 Coding region Proteins 0.000 claims abstract description 23
- 238000000605 extraction Methods 0.000 claims abstract description 22
- 238000004364 calculation method Methods 0.000 claims abstract description 12
- 238000010586 diagram Methods 0.000 claims description 45
- 238000012545 processing Methods 0.000 claims description 15
- 238000013136 deep learning model Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 7
- 238000012805 post-processing Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 230000008034 disappearance Effects 0.000 claims description 3
- 238000004880 explosion Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 abstract description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 230000004807 localization Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 240000000233 Melia azedarach Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/75—Determining position or orientation of objects or cameras using feature-based methods involving models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/52—Scale-space analysis, e.g. wavelet analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The embodiment of the invention discloses a crowd positioning method based on a trans-scale visual transformation network, which comprises the following steps: constructing a feature extraction module to obtain a multi-scale feature map of the training crowd image; constructing a multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd image; constructing a cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence of the training crowd image; constructing a multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image; constructing a loss calculation module to obtain a cross-scale visual transformation network positioning model of the optimal crowd image; and obtaining a positioning result of the input crowd image by utilizing the optimal crowd image trans-scale vision transformation network positioning model. The invention combines the convolutional neural network and the cross window transformation network module, simultaneously learns the multi-scale characteristic information and the long-distance dependency relationship from the crowd image, and further improves the positioning accuracy of the crowd image.
Description
Technical Field
The invention belongs to the fields of digital image processing, computer vision, pattern recognition and artificial intelligence, and particularly relates to a crowd positioning method based on a trans-scale vision transformation network.
Background
Crowd analysis can prevent crowd gathering and trampling accidents, and has great potential in improving public and traffic safety. Crowd localization is a critical task of crowd analysis that aims to predict the location of each individual while estimating the total number of individuals in the crowd. Crowd locating provides detailed information of the spatial distribution of the crowd, and can provide effective crowd management and emergency response, as compared to crowd counting that only estimates the total number of individuals. Crowd positioning faces significant challenges such as illumination, occlusion, and perspective effects. Thus, many approaches have been proposed to overcome these challenges. These methods fall into three main categories: detection-based methods, regression-based methods, and map-based methods.
Most detection-based methods utilize point-level annotations to generate pseudo-bounding boxes. Liu et al initialize the size of the pseudo-bounding box with the nearest neighbor distance between head center points and adjust the pseudo-bounding box by iterative update to train a reliable target detector. Considering that the size of a human head is related to the distance between the head and the camera, lian et al predict the size of a pseudo-bounding box using depth information. However, in very dense scenes, the above approach does not perform well in crowd-locating tasks due to occlusion and blurring.
The regression-based method may directly regress the coordinates of the head points and output a confidence score. Song et al propose a point-based counting and positioning framework that predicts a set of candidate point representations headers directly based on predefined anchor points. Liang et al propose an end-to-end crowd-location model that uses trainable query instances rather than a number of predefined anchor points. However, the above method lacks correlation information between the head point and other pixels, and thus positioning performance is not accurate enough.
The map-based approach can generate a trainable map that can reflect the relationship between head points and adjacent pixels to guide model training. Idress et al use a density map for head positioning, where the local area maximum of the density map is the head point location. Abousamra et al propose a topology that expands the head points slightly into a point mask and uses the point mask map as supervisory information. Xu et al generate a distance tag map according to the distance between head points, which can avoid the problem of head overlap in dense areas. Liang et al propose an inverse focal distance map that better represents the correlation between head points and other pixels. The above method makes full use of spatial information but does not take into account complete multi-scale information.
In the field of crowd analysis, some methods implement a crowd counting and locating method based on a transformation network. For example, gao et al propose to expand a convolution shift window transformation network for crowd localization, which learns feature maps using a shift window transformation network and a feature pyramid network. Lin et al combine global attention and local attention for crowd counting within a transformation network framework. However, the above-described methods do not take into account complete multi-scale information in the learning process. In contrast, the method of the present disclosure considers multi-scale information in both the encoding stage, decoding stage, and loss function.
Disclosure of Invention
The invention aims to solve the technical problem that the multi-scale problem of the head in the crowd image has a great influence on the positioning result, and therefore, the invention provides a crowd positioning method based on a trans-scale visual transformation network.
In order to achieve the purpose, the invention provides a crowd positioning method based on a trans-scale visual transformation network, which comprises the following steps:
step S1, a feature extraction module is constructed by utilizing a pre-training deep learning model, label processing is carried out on training crowd images, a label distance conversion diagram corresponding to the training crowd images is obtained, the training crowd images are input into the feature extraction module, and a multi-scale feature diagram of the training crowd images is obtained;
s2, constructing a multi-scale coding fusion module, inputting the multi-scale feature images of the training crowd images into the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd images;
s3, constructing a cross window transformation network module, inputting the multi-scale coding sequence of the training crowd image into the cross window transformation network module, and learning long-distance dependency relations of different scales by utilizing the cross window transformation network module to obtain the multi-scale long-distance dependency relation sequence of the training crowd image;
s4, constructing a multi-scale decoding fusion module, inputting the multi-scale long-distance dependency relationship sequence of the training crowd image into the multi-scale decoding fusion module, and fusing the multi-scale long-distance dependency relationship sequence by using the multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image;
s5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;
and S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.
Optionally, the step S1 includes the steps of:
step S11, determining VGG-16 as a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;
step S12, preprocessing and label processing are carried out on the training crowd images to obtain a label distance conversion chart corresponding to the training crowd images;
and S13, inputting the preprocessed training crowd images into the feature extraction module to obtain a multi-scale feature map of the training crowd images.
Optionally, the label distance conversion map corresponding to the training crowd image is expressed as:
wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.
Optionally, the step S2 includes the steps of:
s21, fusing the multi-scale feature images of the training crowd images at corresponding positions and the multi-scale feature images shallower than the multi-scale feature images, and converting the multi-scale feature images into a multi-scale sequence;
and S22, performing dimension reduction processing on the multi-scale feature map to convert the multi-scale feature map into feature vectors, and adding and fusing the feature vectors and the multi-scale sequences to obtain the multi-scale coding sequences of the training crowd images.
Optionally, the step S3 includes the steps of:
step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure, wherein the layer standardization module is used for carrying out data distribution standardization, the cross window multi-head self-attention module is used for learning global dependency relations, the multi-layer perceptron module is used for reducing the number of parameters, and the residual error structure acts on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module and is used for relieving the problem of gradient disappearance or explosion;
step S32, connecting B cross window basic units in parallel, and constructing to obtain the cross window transformation network module;
and step S33, inputting the multi-scale coding sequence to the cross window transformation network module, and constructing long-distance dependency relations of different scales on the multi-scale coding sequence by utilizing the cross window transformation network module to obtain a multi-scale long-distance dependency relation sequence of the training crowd image.
Optionally, the multi-scale long-range dependency sequence is expressed as:
L i ' = MLP (LN(S i ) )+ S i
S i = CSW in (LN(L i ) )+ L i
wherein L is i ' represents a multi-scale long-range dependency sequence, L i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW in Representing a cross window multi-headed self-attention module.
Optionally, the step S4 includes the steps of:
s41, fusing the multi-scale long-distance dependency relationship sequence of the training crowd image and the multi-scale long-distance dependency relationship sequence deeper than the multi-scale long-distance dependency relationship sequence, and converting the multi-scale long-distance dependency relationship sequence into a decoding characteristic diagram;
and step S42, fusing the decoding characteristic diagram and a decoding characteristic diagram of a layer deeper than the decoding characteristic diagram to obtain a multi-scale decoding characteristic diagram of the training crowd image, and performing convolution operation of 1 gamma 1 and double up-sampling on the multi-scale decoding characteristic diagram to obtain a distance conversion diagram of the training crowd image.
Optionally, the loss function in the loss calculation module is expressed as:
L= L MSE (E,G) + γ L MSSSIM (E,G)
wherein L represents total loss, L MSE (E, G) represents MSE loss, L MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E q Distance conversion map representing the q-th training crowd image, G q Label distance conversion map representing the q-th training crowd image L SSIM (E qnm , G qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.
Optionally, the SSIM lossExpressed as:
wherein mu E Mean value mu of distance conversion diagram representing training crowd image G Mean value sigma of label distance conversion graph representing training crowd image E Variance, sigma, of distance conversion map representing training crowd image G Variance, sigma, of a label distance conversion map representing images of a training population EG Covariance between distance conversion map and label distance conversion map representing training crowd image, phi 1 And phi 2 Is a constant.
Optionally, the obtaining the positioning result of the input crowd image after performing post-processing on the distance conversion map includes:
step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;
step S62, a first threshold value and a second threshold value are set, wherein the first threshold value is greater than the second threshold value, the local maximum value in the distance conversion graph is compared with the first threshold value and the second threshold value, the point with the local maximum value larger than the first threshold value is confirmed to be an individual head point, and if the global maximum value of the distance conversion graph is smaller than the second threshold value, no person in the input crowd image is confirmed.
The beneficial effects of the invention are as follows: the multi-scale feature map of the crowd image is extracted through the convolutional neural network, the multi-scale feature map is fused by the multi-scale coding fusion module to obtain a multi-scale coding sequence, then the multi-scale coding sequence is modeled with a long-distance dependency relationship by the cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence, and finally the multi-scale long-distance dependency relationship sequence is fused by the multi-scale decoding fusion module. This improves the representation capability of the multi-scale feature map and improves the accuracy of crowd image localization.
Drawings
FIG. 1 is a flow chart of a crowd positioning method based on a cross-scale visual transformation network according to one embodiment of the invention.
Detailed Description
The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.
Fig. 1 is a flowchart of a crowd positioning method based on a cross-scale visual transformation network according to an embodiment of the present invention, and taking fig. 1 as an example, a plurality of implementation processes of the present invention are described below, and as shown in fig. 1, the crowd positioning method based on the cross-scale visual transformation network includes the following steps:
step S1, constructing a feature extraction module by utilizing a pre-training deep learning model, performing label processing on training crowd images to obtain a label distance conversion diagram corresponding to the training crowd images, and inputting the training crowd images into the feature extraction module to obtain a multi-scale feature diagram F of the training crowd images i ;
Further, the step S1 includes the steps of:
step S11, determining a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;
in an embodiment of the present invention, the pre-training deep learning model may be VGG-16, and after parameter initialization, a model component before the last global pooling layer in the model is selected, that is, the last global pooling layer and the full connection layer in the pre-training deep learning model are removed, and the remaining model part forms a feature extraction module. In an embodiment of the present invention, the feature extraction module is composed of 4 modules, which may be named Stage1, stage2, stage3 and Stage4, and the feature extraction module generates feature graphs with different scales from Stage1, stage2, stage3 and Stage4, respectively.
Step S12, preprocessing and label processing are carried out on the training crowd images in a training set, and a label distance conversion chart corresponding to the training crowd images is obtained;
in an embodiment of the present invention, the preprocessing the training crowd image may include: the training crowd image is randomly and horizontally turned over, the probability is set to be 0.5, all pixels of the training crowd image are scaled down to be within a preset range, for example, between 0 and 1, then the average value of the pixels of the training crowd image is subtracted from each pixel value in the training crowd image, the average value of the pixels of the training crowd image is divided by the variance of the pixels of the training crowd image, finally the training crowd image is cut to be a fixed size H W, wherein H is the height of the cut training crowd image, W is the width of the cut training crowd image, and in one embodiment of the invention, H=256 and W=256.
In an embodiment of the present invention, a label distance conversion chart corresponding to the training crowd image obtained after label processing is performed on the training crowd image may be expressed as:
wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.
In one embodiment of the present invention, α=0.02, β=0.75, and c=1.
Step S13, inputting the training crowd image obtained after preprocessing into the feature extraction module to obtain a multi-scale feature map F of the training crowd image i 。
In one embodiment of the present invention, the multi-scale feature mapWherein C is i For training the channel number of the ith scale feature map of the crowd image, for example, C can be set 1 =128,C 2 =256,C 3 =512,C 4 =512, at which point,,,andoutput from Stage1, stage2, stage3 and Stage4, respectively, in the feature extraction module.
Step S2, constructEstablishing a multi-scale coding fusion module, and carrying out multi-scale feature map F on the training crowd image i Inputting the images to the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence L of the training crowd images i ;
Further, the step S2 includes the steps of:
step S21, a multi-scale feature map F of the training crowd image is obtained i Multiscale feature map F at a corresponding location and shallower than it j (0<j<i) Fusion and conversion to Multi-Scale sequence G i ;
In one embodiment of the invention, F 4 G generation 4 Step S21 is described as an example. Because convolution computation is involved in the multi-scale feature map extraction process, the multi-scale feature map F 4 1 x 1 region of (a) corresponds to the multiscale feature map F i 2 in (i=1, 2, 3) 4-I ╳ 2 4-i An area. First, a multi-scale feature map F 4 Conversion into a sequenceNamely by C 4 Personal (S)The sequence sum of the feature vector components of the dimension is used for obtaining the residual shallow multi-scale feature map F i (i=1, 2, 3) to sequences respectivelyWherein N is i = 2 4-I ╳ 2 4-i . The sequence D is then applied using a linear layer i Conversion into a sequence. Finally the sequence D i ′ (i=1, 2,3, 4) to obtain a multi-scale sequence. Can be obtained in the same way。
Step S22, for the multi-scale feature map F i Performing dimension reduction processing to convert the dimension reduction processing into feature vectors and combining the feature vectors with the multi-scale sequence G i Adding and fusing to obtain a multi-scale coding sequence L of the training crowd images i 。
S3, constructing a cross window transformation network module, and enabling the multi-scale coding sequence L of the training crowd image to be i Inputting the images to the cross window transformation network module, and learning long-distance dependency relationships with different scales by using the cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence L of the training crowd images i ';
Further, the step S3 includes the steps of:
step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure;
in the step, firstly, a layer standardization module is utilized to conduct data distribution standardization so as to accelerate model convergence, then a cross window multi-head self-attention module is utilized to learn global dependency, finally, a multi-layer perceptron module is utilized to reduce the number of parameters, and in addition, a residual structure can be utilized to relieve the problem of gradient disappearance or explosion, for example, the residual structure can act on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module. Wherein the layer normalization module, residual error structure and multi-layer perceptron module are common computing modules in the art, which are not described too much in this disclosure.
Step S32, utilizeConstructing a cross window basic unit to obtain the cross window transformation network module;
wherein the number B of the basic units of the cross window and the multi-scale coding sequence L i Corresponding to the number of scales of (a), in one embodiment of the invention, the multi-scale code sequence L i Is 4, so b=4. Wherein,,the B cross window basic units are connected in parallel to obtain the cross window transformation network module, and each cross window basic unit corresponds to L on one scale of the training crowd image i 。
Step S33, the multi-scale coding sequence L is processed i Inputting the multi-scale code sequence L to the cross window transformation network module, and utilizing the cross window transformation network module to perform multi-scale code sequence L i Constructing long-distance dependency relations of different scales to obtain a multi-scale long-distance dependency relation sequence L of the training crowd image i '。
In one embodiment of the invention, the multi-scale long-distance dependency sequence L i ' can be expressed as:
L i ' = MLP (LN(S i ) )+ S i
S i = CSW in (LN(L i ) )+ L i
wherein L is i ' represents a multi-scale long-range dependency sequence, L i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW in Representing a cross window multi-headed self-attention module.
In one embodiment of the present invention, the cross window multi-head self-attention module is composed of a horizontal stripe self-attention module and a vertical stripe self-attention module, and the i-th-scale multi-scale coding sequence L i The output of the multi-headed self-attention module at the cross window is expressed as:
CSW in (L i ) = Concat (head 1 , …, head k , …, head K ) W 0
wherein CSW in Representing a cross window multi-head self-attention module, concat representing a cascading function, K representing the number of heads in the cross window multi-head self-attention module, head k Multi-headed self-attention, W, of cross window representing the kth head 0 Representing the mapping momentArray, H Att-k (L i ) Representing horizontal stripe self-attention, V Att-k (L i ) Representing vertical streak self-attention.
In one embodiment of the present invention, k=2.
In one embodiment of the present invention, the i-th scale multi-scale coding sequence L i The horizontal stripe self-attention output of (2) is expressed as:
wherein Q is ik a = (L i a ) T W ik Q ,K ik a = (L i a ) T W ik K ,V ik a = (L i a ) T W ik V Respectively, query, key and value, L i a Is L i The horizontal stripe sequence in the matrix A is L i Number of horizontal stripe sequences within, W ik Q ,W ik K Andmapping matrix of query, key and value, respectively, C i Channel number d of ith scale feature image of training crowd image k Is the dimension of the kth head in the cross window multi-head self-attention. Similarly, vertical stripe self-attention V Att-k (L i )。
S4, constructing a multi-scale decoding fusion module, and enabling the multi-scale long-distance dependency relationship sequence L of the training crowd image to be i ' input to the multi-scale decoding fusion module, and utilize the multi-scale decoding fusion module to perform multi-scale long-distance dependency relationship sequence L i ' fusion is carried out to obtain a multi-scale decoding feature map of the training crowd imageAnd a distance conversion map E.
Further, the step S4 includes the steps of:
step S41, a multi-scale long-distance dependency relationship sequence L of the training crowd image is obtained i ' and deeper multiscale long range dependency sequence L j ' (j>i>0) Fusion and conversion to decoding feature map F i ';
In one embodiment of the invention, F is generated 1 ' step S41 is illustrated. Firstly, the multi-scale long-distance dependency relation sequence L of the training crowd image is obtained i ' transform to and the multiscale feature map F i Feature map Z of the same size i (i=1, 2,3, 4). The feature map Z is then up-sampled i (i=1, 2,3, 4) transformed into and said multiscale feature map F 1 The feature images with the same size are added to obtain a decoding feature image F 1 ' decoding feature map F can be obtained by the same method i ' (i=2,3,4)。
Step S42, the decoding characteristic diagram F i ' sum of decoding feature map F of a layer deeper than it i+1 'fusion' to obtain a multi-scale decoding feature map of the training crowd imageFor multi-scale decoding feature mapAnd performing convolution operation and double up-sampling of the 1 gamma 1 to obtain a distance conversion graph E of the training crowd image.
In one embodiment of the invention, to generateStep S42 is illustrated for purposes of example. First the decoding feature map F 4 ' performing a convolution operation of 1 x 1 generates a multi-scale decoded feature map. The multi-scale decoded feature map is then processedPerforming double up-sampling and adding the decoded feature map F 3 ' post-performing a 1 gamma 1 convolution operation generates a multi-scale decoded feature map. And similarly, a multi-scale decoding characteristic diagram can be obtained。
In one embodiment of the invention, the multi-scale decoding feature mapAnd performing convolution operation of 1 gamma 1 and double up-sampling to obtain a distance conversion map E of the training crowd image.
S5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram E of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;
further, the step S5 includes the steps of:
step S51, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model;
step S52, constructing a loss calculation module, and inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module;
in one embodiment of the present invention, when training is performed in the UCF-QNRF database, the Loss function Loss of the constructed Loss calculation module is expressed as:
L= L MSE (E,G) + γ L MSSSIM (E,G)
wherein L represents total loss, L MSE (E, G) represents MSE loss, L MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E q Distance conversion map representing the q-th training crowd image, G q Label distance conversion map representing the q-th training crowd image L SSIM (E qnm , G qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.
In one embodiment of the invention, the SSIM loses L SSIM Can be expressed as:
wherein E represents a distance conversion map of the training crowd image, G represents a label distance conversion map of the training crowd image, μ E Mean value mu of distance conversion diagram representing training crowd image G Mean value sigma of label distance conversion graph representing training crowd image E Variance, sigma, of distance conversion map representing training crowd image G Variance, sigma, of a label distance conversion map representing images of a training population EG Distance representing images of training crowdCovariance between the distance from the transformation map and the label distance transformation map, phi 1 And phi 2 Is a constant.
In one embodiment of the present invention, γ=0.1, Φ 1 = 1╳10 -4 ,φ 2 = 9╳10 -4 。
And step S53, optimizing the crowd image trans-scale visual transformation network positioning model by using the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model.
In the step, iterative computation can be performed by means of a random gradient descent method to optimize the crowd image trans-scale visual transformation network positioning model.
And S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.
In an embodiment of the present invention, the step of obtaining the positioning result of the input crowd image after performing post-processing on the distance conversion map may include the following steps:
step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;
step S62, setting two threshold values, namely a first threshold value T max And a second threshold T min Wherein T is max >T min Comparing the local maximum value in the distance conversion map with the first threshold value and the second threshold value, and making the local maximum value larger than the first threshold value T max Is identified as an individual head point if the global maximum of the distance conversion map is less than a second threshold T min And confirming that no person exists in the input crowd image.
In one embodiment of the invention, T max 110/255 times of the global maximum of the distance conversion graph, T min =0.1。
The crowd image positioning accuracy rate of the invention reaches 85.6% (Average Precision), 80.6% (Average return) and 83.1% (Average F1-measure) when the crowd image large database disclosed on the internet is used as a test object, for example, when the crowd image large database is tested on the UCF-QNRF database. According to the method, the multi-scale feature map of the crowd image is learned through the convolutional neural network, the multi-scale feature map is fused by the multi-scale coding fusion module to generate a multi-scale coding sequence, the multi-scale coding sequence is modeled by the cross window transformation network module to generate a multi-scale long-distance dependency relationship sequence, and finally the multi-scale long-distance dependency relationship sequence is fused by the multi-scale decoding fusion module, so that the multi-scale feature information and the long-distance dependency relationship are learned at the same time, the accuracy of locating the crowd image is improved to a great extent, and the effectiveness of the method is seen.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explanation of the principles of the present invention and are in no way limiting of the invention. Accordingly, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention should be included in the scope of the present invention. Furthermore, the appended claims are intended to cover all such changes and modifications that fall within the scope and boundary of the appended claims, or equivalents of such scope and boundary.
Claims (10)
1. A crowd positioning method based on a trans-scale visual transformation network, the method comprising the steps of:
step S1, a feature extraction module is constructed by utilizing a pre-training deep learning model, label processing is carried out on training crowd images, a label distance conversion diagram corresponding to the training crowd images is obtained, the training crowd images are input into the feature extraction module, and a multi-scale feature diagram of the training crowd images is obtained;
s2, constructing a multi-scale coding fusion module, inputting the multi-scale feature images of the training crowd images into the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd images;
s3, constructing a cross window transformation network module, inputting the multi-scale coding sequence of the training crowd image into the cross window transformation network module, and learning long-distance dependency relations of different scales by utilizing the cross window transformation network module to obtain the multi-scale long-distance dependency relation sequence of the training crowd image;
s4, constructing a multi-scale decoding fusion module, inputting the multi-scale long-distance dependency relationship sequence of the training crowd image into the multi-scale decoding fusion module, and fusing the multi-scale long-distance dependency relationship sequence by using the multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image;
s5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;
and S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.
2. The method according to claim 1, wherein the step S1 comprises the steps of:
step S11, determining VGG-16 as a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;
step S12, preprocessing and label processing are carried out on the training crowd images to obtain a label distance conversion chart corresponding to the training crowd images;
and S13, inputting the preprocessed training crowd images into the feature extraction module to obtain a multi-scale feature map of the training crowd images.
3. The method according to claim 1 or 2, wherein the label distance conversion map corresponding to the training crowd image is represented as:
;
wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.
4. The method according to claim 1, wherein said step S2 comprises the steps of:
s21, fusing the multi-scale feature images of the training crowd images at corresponding positions and the multi-scale feature images shallower than the multi-scale feature images, and converting the multi-scale feature images into a multi-scale sequence;
and S22, performing dimension reduction processing on the multi-scale feature map to convert the multi-scale feature map into feature vectors, and adding and fusing the feature vectors and the multi-scale sequences to obtain the multi-scale coding sequences of the training crowd images.
5. The method according to claim 1, wherein said step S3 comprises the steps of:
step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure, wherein the layer standardization module is used for carrying out data distribution standardization, the cross window multi-head self-attention module is used for learning global dependency relations, the multi-layer perceptron module is used for reducing the number of parameters, and the residual error structure acts on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module and is used for relieving the problem of gradient disappearance or explosion;
step S32, connecting B cross window basic units in parallel, and constructing to obtain the cross window transformation network module;
and step S33, inputting the multi-scale coding sequence to the cross window transformation network module, and constructing long-distance dependency relations of different scales on the multi-scale coding sequence by utilizing the cross window transformation network module to obtain a multi-scale long-distance dependency relation sequence of the training crowd image.
6. The method of claim 5, wherein the multi-scale long-range dependency sequence is expressed as:
L i ' = MLP (LN(S i ) )+ S i ;
S i = CSW in (LN(L i ) )+ L i ;
wherein L is i ' represents a multi-scale long-range dependency sequence, L i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW in Representing a cross window multi-headed self-attention module.
7. The method according to claim 1, wherein said step S4 comprises the steps of:
s41, fusing the multi-scale long-distance dependency relationship sequence of the training crowd image and the multi-scale long-distance dependency relationship sequence deeper than the multi-scale long-distance dependency relationship sequence, and converting the multi-scale long-distance dependency relationship sequence into a decoding characteristic diagram;
and step S42, fusing the decoding characteristic diagram and a decoding characteristic diagram of a layer deeper than the decoding characteristic diagram to obtain a multi-scale decoding characteristic diagram of the training crowd image, and performing convolution operation of 1 gamma 1 and double up-sampling on the multi-scale decoding characteristic diagram to obtain a distance conversion diagram of the training crowd image.
8. The method of claim 1, wherein the loss function in the loss calculation module is expressed as:
L= L MSE (E,G) + γ L MSSSIM (E,G);
;
;
wherein L represents total loss, L MSE (E, G) represents MSE loss, L MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E q Distance conversion map representing the q-th training crowd image, G q Label distance conversion map representing the q-th training crowd image L SSIM (E qnm , G qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.
9. The method of claim 8, wherein the SSIM loses L SSIM Expressed as:
;
wherein mu E Mean value mu of distance conversion diagram representing training crowd image G Mean value sigma of label distance conversion graph representing training crowd image E Variance, sigma, of distance conversion map representing training crowd image G Label distance conversion representing training crowd imagesVariance, sigma of graph EG Covariance between distance conversion map and label distance conversion map representing training crowd image, phi 1 And phi 2 Is a constant.
10. The method of claim 1, wherein the performing post-processing on the distance conversion map to obtain a positioning result of the input crowd image comprises:
step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;
step S62, a first threshold value and a second threshold value are set, wherein the first threshold value is greater than the second threshold value, the local maximum value in the distance conversion graph is compared with the first threshold value and the second threshold value, the point with the local maximum value larger than the first threshold value is confirmed to be an individual head point, and if the global maximum value of the distance conversion graph is smaller than the second threshold value, no person in the input crowd image is confirmed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311074895.9A CN116805337B (en) | 2023-08-25 | 2023-08-25 | Crowd positioning method based on trans-scale visual transformation network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311074895.9A CN116805337B (en) | 2023-08-25 | 2023-08-25 | Crowd positioning method based on trans-scale visual transformation network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116805337A CN116805337A (en) | 2023-09-26 |
CN116805337B true CN116805337B (en) | 2023-10-27 |
Family
ID=88079738
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311074895.9A Active CN116805337B (en) | 2023-08-25 | 2023-08-25 | Crowd positioning method based on trans-scale visual transformation network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116805337B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804992A (en) * | 2017-05-08 | 2018-11-13 | 电子科技大学 | A kind of Demographics' method based on deep learning |
CN111242036A (en) * | 2020-01-14 | 2020-06-05 | 西安建筑科技大学 | Crowd counting method based on encoding-decoding structure multi-scale convolutional neural network |
CN113139489A (en) * | 2021-04-30 | 2021-07-20 | 广州大学 | Crowd counting method and system based on background extraction and multi-scale fusion network |
CN114120361A (en) * | 2021-11-19 | 2022-03-01 | 西南交通大学 | Crowd counting and positioning method based on coding and decoding structure |
CN114445765A (en) * | 2021-12-23 | 2022-05-06 | 上海师范大学 | Crowd counting and density estimating method based on coding and decoding structure |
CN115311508A (en) * | 2022-08-09 | 2022-11-08 | 北京邮电大学 | Single-frame image infrared dim target detection method based on depth U-type network |
CN116091764A (en) * | 2022-12-28 | 2023-05-09 | 天津师范大学 | Cloud image segmentation method based on fusion transformation network |
CN116246305A (en) * | 2023-01-31 | 2023-06-09 | 天津师范大学 | Pedestrian retrieval method based on hybrid component transformation network |
-
2023
- 2023-08-25 CN CN202311074895.9A patent/CN116805337B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804992A (en) * | 2017-05-08 | 2018-11-13 | 电子科技大学 | A kind of Demographics' method based on deep learning |
CN111242036A (en) * | 2020-01-14 | 2020-06-05 | 西安建筑科技大学 | Crowd counting method based on encoding-decoding structure multi-scale convolutional neural network |
CN113139489A (en) * | 2021-04-30 | 2021-07-20 | 广州大学 | Crowd counting method and system based on background extraction and multi-scale fusion network |
CN114120361A (en) * | 2021-11-19 | 2022-03-01 | 西南交通大学 | Crowd counting and positioning method based on coding and decoding structure |
CN114445765A (en) * | 2021-12-23 | 2022-05-06 | 上海师范大学 | Crowd counting and density estimating method based on coding and decoding structure |
CN115311508A (en) * | 2022-08-09 | 2022-11-08 | 北京邮电大学 | Single-frame image infrared dim target detection method based on depth U-type network |
CN116091764A (en) * | 2022-12-28 | 2023-05-09 | 天津师范大学 | Cloud image segmentation method based on fusion transformation network |
CN116246305A (en) * | 2023-01-31 | 2023-06-09 | 天津师范大学 | Pedestrian retrieval method based on hybrid component transformation network |
Also Published As
Publication number | Publication date |
---|---|
CN116805337A (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111539370A (en) | Image pedestrian re-identification method and system based on multi-attention joint learning | |
CN115240121B (en) | Joint modeling method and device for enhancing local features of pedestrians | |
US10943352B2 (en) | Object shape regression using wasserstein distance | |
CN113298815A (en) | Semi-supervised remote sensing image semantic segmentation method and device and computer equipment | |
CN114863348B (en) | Video target segmentation method based on self-supervision | |
CN111626134A (en) | Dense crowd counting method, system and terminal based on hidden density distribution | |
CN115393690A (en) | Light neural network air-to-ground observation multi-target identification method | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
CN116246338A (en) | Behavior recognition method based on graph convolution and transducer composite neural network | |
CN117576724A (en) | Unmanned plane bird detection method, system, equipment and medium | |
CN115035599A (en) | Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics | |
CN114550014A (en) | Road segmentation method and computer device | |
Feng | Mask RCNN-based single shot multibox detector for gesture recognition in physical education | |
CN117765258A (en) | Large-scale point cloud semantic segmentation method based on density self-adaption and attention mechanism | |
CN113762331A (en) | Relational self-distillation method, apparatus and system, and storage medium | |
CN113989612A (en) | Remote sensing image target detection method based on attention and generation countermeasure network | |
Oh et al. | Local selective vision transformer for depth estimation using a compound eye camera | |
CN117237411A (en) | Pedestrian multi-target tracking method based on deep learning | |
CN116805337B (en) | Crowd positioning method based on trans-scale visual transformation network | |
CN116665293A (en) | Sitting posture early warning method and system based on monocular vision | |
CN115984132A (en) | Short-term prediction method based on CBAIM differential recurrent neural network | |
CN114913485A (en) | Multi-level feature fusion weak supervision detection method | |
CN114694042A (en) | Disguised person target detection method based on improved Scaled-YOLOv4 | |
CN113706650A (en) | Image generation method based on attention mechanism and flow model | |
CN113449611B (en) | Helmet recognition intelligent monitoring system based on YOLO network compression algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |