CN116805337B - Crowd positioning method based on trans-scale visual transformation network - Google Patents

Crowd positioning method based on trans-scale visual transformation network Download PDF

Info

Publication number
CN116805337B
CN116805337B CN202311074895.9A CN202311074895A CN116805337B CN 116805337 B CN116805337 B CN 116805337B CN 202311074895 A CN202311074895 A CN 202311074895A CN 116805337 B CN116805337 B CN 116805337B
Authority
CN
China
Prior art keywords
scale
training
module
crowd image
crowd
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311074895.9A
Other languages
Chinese (zh)
Other versions
CN116805337A (en
Inventor
张重
连宇
刘爽
郭蓬
高嵩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Tianjin Normal University
CATARC Tianjin Automotive Engineering Research Institute Co Ltd
Original Assignee
Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Tianjin Normal University
CATARC Tianjin Automotive Engineering Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd, Tianjin Normal University, CATARC Tianjin Automotive Engineering Research Institute Co Ltd filed Critical Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd
Priority to CN202311074895.9A priority Critical patent/CN116805337B/en
Publication of CN116805337A publication Critical patent/CN116805337A/en
Application granted granted Critical
Publication of CN116805337B publication Critical patent/CN116805337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The embodiment of the invention discloses a crowd positioning method based on a trans-scale visual transformation network, which comprises the following steps: constructing a feature extraction module to obtain a multi-scale feature map of the training crowd image; constructing a multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd image; constructing a cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence of the training crowd image; constructing a multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image; constructing a loss calculation module to obtain a cross-scale visual transformation network positioning model of the optimal crowd image; and obtaining a positioning result of the input crowd image by utilizing the optimal crowd image trans-scale vision transformation network positioning model. The invention combines the convolutional neural network and the cross window transformation network module, simultaneously learns the multi-scale characteristic information and the long-distance dependency relationship from the crowd image, and further improves the positioning accuracy of the crowd image.

Description

Crowd positioning method based on trans-scale visual transformation network
Technical Field
The invention belongs to the fields of digital image processing, computer vision, pattern recognition and artificial intelligence, and particularly relates to a crowd positioning method based on a trans-scale vision transformation network.
Background
Crowd analysis can prevent crowd gathering and trampling accidents, and has great potential in improving public and traffic safety. Crowd localization is a critical task of crowd analysis that aims to predict the location of each individual while estimating the total number of individuals in the crowd. Crowd locating provides detailed information of the spatial distribution of the crowd, and can provide effective crowd management and emergency response, as compared to crowd counting that only estimates the total number of individuals. Crowd positioning faces significant challenges such as illumination, occlusion, and perspective effects. Thus, many approaches have been proposed to overcome these challenges. These methods fall into three main categories: detection-based methods, regression-based methods, and map-based methods.
Most detection-based methods utilize point-level annotations to generate pseudo-bounding boxes. Liu et al initialize the size of the pseudo-bounding box with the nearest neighbor distance between head center points and adjust the pseudo-bounding box by iterative update to train a reliable target detector. Considering that the size of a human head is related to the distance between the head and the camera, lian et al predict the size of a pseudo-bounding box using depth information. However, in very dense scenes, the above approach does not perform well in crowd-locating tasks due to occlusion and blurring.
The regression-based method may directly regress the coordinates of the head points and output a confidence score. Song et al propose a point-based counting and positioning framework that predicts a set of candidate point representations headers directly based on predefined anchor points. Liang et al propose an end-to-end crowd-location model that uses trainable query instances rather than a number of predefined anchor points. However, the above method lacks correlation information between the head point and other pixels, and thus positioning performance is not accurate enough.
The map-based approach can generate a trainable map that can reflect the relationship between head points and adjacent pixels to guide model training. Idress et al use a density map for head positioning, where the local area maximum of the density map is the head point location. Abousamra et al propose a topology that expands the head points slightly into a point mask and uses the point mask map as supervisory information. Xu et al generate a distance tag map according to the distance between head points, which can avoid the problem of head overlap in dense areas. Liang et al propose an inverse focal distance map that better represents the correlation between head points and other pixels. The above method makes full use of spatial information but does not take into account complete multi-scale information.
In the field of crowd analysis, some methods implement a crowd counting and locating method based on a transformation network. For example, gao et al propose to expand a convolution shift window transformation network for crowd localization, which learns feature maps using a shift window transformation network and a feature pyramid network. Lin et al combine global attention and local attention for crowd counting within a transformation network framework. However, the above-described methods do not take into account complete multi-scale information in the learning process. In contrast, the method of the present disclosure considers multi-scale information in both the encoding stage, decoding stage, and loss function.
Disclosure of Invention
The invention aims to solve the technical problem that the multi-scale problem of the head in the crowd image has a great influence on the positioning result, and therefore, the invention provides a crowd positioning method based on a trans-scale visual transformation network.
In order to achieve the purpose, the invention provides a crowd positioning method based on a trans-scale visual transformation network, which comprises the following steps:
step S1, a feature extraction module is constructed by utilizing a pre-training deep learning model, label processing is carried out on training crowd images, a label distance conversion diagram corresponding to the training crowd images is obtained, the training crowd images are input into the feature extraction module, and a multi-scale feature diagram of the training crowd images is obtained;
s2, constructing a multi-scale coding fusion module, inputting the multi-scale feature images of the training crowd images into the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd images;
s3, constructing a cross window transformation network module, inputting the multi-scale coding sequence of the training crowd image into the cross window transformation network module, and learning long-distance dependency relations of different scales by utilizing the cross window transformation network module to obtain the multi-scale long-distance dependency relation sequence of the training crowd image;
s4, constructing a multi-scale decoding fusion module, inputting the multi-scale long-distance dependency relationship sequence of the training crowd image into the multi-scale decoding fusion module, and fusing the multi-scale long-distance dependency relationship sequence by using the multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image;
s5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;
and S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.
Optionally, the step S1 includes the steps of:
step S11, determining VGG-16 as a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;
step S12, preprocessing and label processing are carried out on the training crowd images to obtain a label distance conversion chart corresponding to the training crowd images;
and S13, inputting the preprocessed training crowd images into the feature extraction module to obtain a multi-scale feature map of the training crowd images.
Optionally, the label distance conversion map corresponding to the training crowd image is expressed as:
wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.
Optionally, the step S2 includes the steps of:
s21, fusing the multi-scale feature images of the training crowd images at corresponding positions and the multi-scale feature images shallower than the multi-scale feature images, and converting the multi-scale feature images into a multi-scale sequence;
and S22, performing dimension reduction processing on the multi-scale feature map to convert the multi-scale feature map into feature vectors, and adding and fusing the feature vectors and the multi-scale sequences to obtain the multi-scale coding sequences of the training crowd images.
Optionally, the step S3 includes the steps of:
step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure, wherein the layer standardization module is used for carrying out data distribution standardization, the cross window multi-head self-attention module is used for learning global dependency relations, the multi-layer perceptron module is used for reducing the number of parameters, and the residual error structure acts on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module and is used for relieving the problem of gradient disappearance or explosion;
step S32, connecting B cross window basic units in parallel, and constructing to obtain the cross window transformation network module;
and step S33, inputting the multi-scale coding sequence to the cross window transformation network module, and constructing long-distance dependency relations of different scales on the multi-scale coding sequence by utilizing the cross window transformation network module to obtain a multi-scale long-distance dependency relation sequence of the training crowd image.
Optionally, the multi-scale long-range dependency sequence is expressed as:
L i ' = MLP (LN(S i ) )+ S i
S i = CSW in (LN(L i ) )+ L i
wherein L is i ' represents a multi-scale long-range dependency sequence, L i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW in Representing a cross window multi-headed self-attention module.
Optionally, the step S4 includes the steps of:
s41, fusing the multi-scale long-distance dependency relationship sequence of the training crowd image and the multi-scale long-distance dependency relationship sequence deeper than the multi-scale long-distance dependency relationship sequence, and converting the multi-scale long-distance dependency relationship sequence into a decoding characteristic diagram;
and step S42, fusing the decoding characteristic diagram and a decoding characteristic diagram of a layer deeper than the decoding characteristic diagram to obtain a multi-scale decoding characteristic diagram of the training crowd image, and performing convolution operation of 1 gamma 1 and double up-sampling on the multi-scale decoding characteristic diagram to obtain a distance conversion diagram of the training crowd image.
Optionally, the loss function in the loss calculation module is expressed as:
L= L MSE (E,G) + γ L MSSSIM (E,G)
wherein L represents total loss, L MSE (E, G) represents MSE loss, L MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E q Distance conversion map representing the q-th training crowd image, G q Label distance conversion map representing the q-th training crowd image L SSIM (E qnm , G qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.
Optionally, the SSIM lossExpressed as:
wherein mu E Mean value mu of distance conversion diagram representing training crowd image G Mean value sigma of label distance conversion graph representing training crowd image E Variance, sigma, of distance conversion map representing training crowd image G Variance, sigma, of a label distance conversion map representing images of a training population EG Covariance between distance conversion map and label distance conversion map representing training crowd image, phi 1 And phi 2 Is a constant.
Optionally, the obtaining the positioning result of the input crowd image after performing post-processing on the distance conversion map includes:
step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;
step S62, a first threshold value and a second threshold value are set, wherein the first threshold value is greater than the second threshold value, the local maximum value in the distance conversion graph is compared with the first threshold value and the second threshold value, the point with the local maximum value larger than the first threshold value is confirmed to be an individual head point, and if the global maximum value of the distance conversion graph is smaller than the second threshold value, no person in the input crowd image is confirmed.
The beneficial effects of the invention are as follows: the multi-scale feature map of the crowd image is extracted through the convolutional neural network, the multi-scale feature map is fused by the multi-scale coding fusion module to obtain a multi-scale coding sequence, then the multi-scale coding sequence is modeled with a long-distance dependency relationship by the cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence, and finally the multi-scale long-distance dependency relationship sequence is fused by the multi-scale decoding fusion module. This improves the representation capability of the multi-scale feature map and improves the accuracy of crowd image localization.
Drawings
FIG. 1 is a flow chart of a crowd positioning method based on a cross-scale visual transformation network according to one embodiment of the invention.
Detailed Description
The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.
Fig. 1 is a flowchart of a crowd positioning method based on a cross-scale visual transformation network according to an embodiment of the present invention, and taking fig. 1 as an example, a plurality of implementation processes of the present invention are described below, and as shown in fig. 1, the crowd positioning method based on the cross-scale visual transformation network includes the following steps:
step S1, constructing a feature extraction module by utilizing a pre-training deep learning model, performing label processing on training crowd images to obtain a label distance conversion diagram corresponding to the training crowd images, and inputting the training crowd images into the feature extraction module to obtain a multi-scale feature diagram F of the training crowd images i
Further, the step S1 includes the steps of:
step S11, determining a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;
in an embodiment of the present invention, the pre-training deep learning model may be VGG-16, and after parameter initialization, a model component before the last global pooling layer in the model is selected, that is, the last global pooling layer and the full connection layer in the pre-training deep learning model are removed, and the remaining model part forms a feature extraction module. In an embodiment of the present invention, the feature extraction module is composed of 4 modules, which may be named Stage1, stage2, stage3 and Stage4, and the feature extraction module generates feature graphs with different scales from Stage1, stage2, stage3 and Stage4, respectively.
Step S12, preprocessing and label processing are carried out on the training crowd images in a training set, and a label distance conversion chart corresponding to the training crowd images is obtained;
in an embodiment of the present invention, the preprocessing the training crowd image may include: the training crowd image is randomly and horizontally turned over, the probability is set to be 0.5, all pixels of the training crowd image are scaled down to be within a preset range, for example, between 0 and 1, then the average value of the pixels of the training crowd image is subtracted from each pixel value in the training crowd image, the average value of the pixels of the training crowd image is divided by the variance of the pixels of the training crowd image, finally the training crowd image is cut to be a fixed size H W, wherein H is the height of the cut training crowd image, W is the width of the cut training crowd image, and in one embodiment of the invention, H=256 and W=256.
In an embodiment of the present invention, a label distance conversion chart corresponding to the training crowd image obtained after label processing is performed on the training crowd image may be expressed as:
wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.
In one embodiment of the present invention, α=0.02, β=0.75, and c=1.
Step S13, inputting the training crowd image obtained after preprocessing into the feature extraction module to obtain a multi-scale feature map F of the training crowd image i
In one embodiment of the present invention, the multi-scale feature mapWherein C is i For training the channel number of the ith scale feature map of the crowd image, for example, C can be set 1 =128,C 2 =256,C 3 =512,C 4 =512, at which point,andoutput from Stage1, stage2, stage3 and Stage4, respectively, in the feature extraction module.
Step S2, constructEstablishing a multi-scale coding fusion module, and carrying out multi-scale feature map F on the training crowd image i Inputting the images to the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence L of the training crowd images i
Further, the step S2 includes the steps of:
step S21, a multi-scale feature map F of the training crowd image is obtained i Multiscale feature map F at a corresponding location and shallower than it j (0<j<i) Fusion and conversion to Multi-Scale sequence G i
In one embodiment of the invention, F 4 G generation 4 Step S21 is described as an example. Because convolution computation is involved in the multi-scale feature map extraction process, the multi-scale feature map F 4 1 x 1 region of (a) corresponds to the multiscale feature map F i 2 in (i=1, 2, 3) 4-I ╳ 2 4-i An area. First, a multi-scale feature map F 4 Conversion into a sequenceNamely by C 4 Personal (S)The sequence sum of the feature vector components of the dimension is used for obtaining the residual shallow multi-scale feature map F i (i=1, 2, 3) to sequences respectivelyWherein N is i = 2 4-I ╳ 2 4-i . The sequence D is then applied using a linear layer i Conversion into a sequence. Finally the sequence D i (i=1, 2,3, 4) to obtain a multi-scale sequence. Can be obtained in the same way
Step S22, for the multi-scale feature map F i Performing dimension reduction processing to convert the dimension reduction processing into feature vectors and combining the feature vectors with the multi-scale sequence G i Adding and fusing to obtain a multi-scale coding sequence L of the training crowd images i
S3, constructing a cross window transformation network module, and enabling the multi-scale coding sequence L of the training crowd image to be i Inputting the images to the cross window transformation network module, and learning long-distance dependency relationships with different scales by using the cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence L of the training crowd images i ';
Further, the step S3 includes the steps of:
step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure;
in the step, firstly, a layer standardization module is utilized to conduct data distribution standardization so as to accelerate model convergence, then a cross window multi-head self-attention module is utilized to learn global dependency, finally, a multi-layer perceptron module is utilized to reduce the number of parameters, and in addition, a residual structure can be utilized to relieve the problem of gradient disappearance or explosion, for example, the residual structure can act on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module. Wherein the layer normalization module, residual error structure and multi-layer perceptron module are common computing modules in the art, which are not described too much in this disclosure.
Step S32, utilizeConstructing a cross window basic unit to obtain the cross window transformation network module;
wherein the number B of the basic units of the cross window and the multi-scale coding sequence L i Corresponding to the number of scales of (a), in one embodiment of the invention, the multi-scale code sequence L i Is 4, so b=4. Wherein,,the B cross window basic units are connected in parallel to obtain the cross window transformation network module, and each cross window basic unit corresponds to L on one scale of the training crowd image i
Step S33, the multi-scale coding sequence L is processed i Inputting the multi-scale code sequence L to the cross window transformation network module, and utilizing the cross window transformation network module to perform multi-scale code sequence L i Constructing long-distance dependency relations of different scales to obtain a multi-scale long-distance dependency relation sequence L of the training crowd image i '。
In one embodiment of the invention, the multi-scale long-distance dependency sequence L i ' can be expressed as:
L i ' = MLP (LN(S i ) )+ S i
S i = CSW in (LN(L i ) )+ L i
wherein L is i ' represents a multi-scale long-range dependency sequence, L i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW in Representing a cross window multi-headed self-attention module.
In one embodiment of the present invention, the cross window multi-head self-attention module is composed of a horizontal stripe self-attention module and a vertical stripe self-attention module, and the i-th-scale multi-scale coding sequence L i The output of the multi-headed self-attention module at the cross window is expressed as:
CSW in (L i ) = Concat (head 1 , …, head k , …, head K ) W 0
wherein CSW in Representing a cross window multi-head self-attention module, concat representing a cascading function, K representing the number of heads in the cross window multi-head self-attention module, head k Multi-headed self-attention, W, of cross window representing the kth head 0 Representing the mapping momentArray, H Att-k (L i ) Representing horizontal stripe self-attention, V Att-k (L i ) Representing vertical streak self-attention.
In one embodiment of the present invention, k=2.
In one embodiment of the present invention, the i-th scale multi-scale coding sequence L i The horizontal stripe self-attention output of (2) is expressed as:
wherein Q is ik a = (L i a ) T W ik Q ,K ik a = (L i a ) T W ik K ,V ik a = (L i a ) T W ik V Respectively, query, key and value, L i a Is L i The horizontal stripe sequence in the matrix A is L i Number of horizontal stripe sequences within, W ik Q ,W ik K Andmapping matrix of query, key and value, respectively, C i Channel number d of ith scale feature image of training crowd image k Is the dimension of the kth head in the cross window multi-head self-attention. Similarly, vertical stripe self-attention V Att-k (L i )。
S4, constructing a multi-scale decoding fusion module, and enabling the multi-scale long-distance dependency relationship sequence L of the training crowd image to be i ' input to the multi-scale decoding fusion module, and utilize the multi-scale decoding fusion module to perform multi-scale long-distance dependency relationship sequence L i ' fusion is carried out to obtain a multi-scale decoding feature map of the training crowd imageAnd a distance conversion map E.
Further, the step S4 includes the steps of:
step S41, a multi-scale long-distance dependency relationship sequence L of the training crowd image is obtained i ' and deeper multiscale long range dependency sequence L j ' (j>i>0) Fusion and conversion to decoding feature map F i ';
In one embodiment of the invention, F is generated 1 ' step S41 is illustrated. Firstly, the multi-scale long-distance dependency relation sequence L of the training crowd image is obtained i ' transform to and the multiscale feature map F i Feature map Z of the same size i (i=1, 2,3, 4). The feature map Z is then up-sampled i (i=1, 2,3, 4) transformed into and said multiscale feature map F 1 The feature images with the same size are added to obtain a decoding feature image F 1 ' decoding feature map F can be obtained by the same method i ' (i=2,3,4)。
Step S42, the decoding characteristic diagram F i ' sum of decoding feature map F of a layer deeper than it i+1 'fusion' to obtain a multi-scale decoding feature map of the training crowd imageFor multi-scale decoding feature mapAnd performing convolution operation and double up-sampling of the 1 gamma 1 to obtain a distance conversion graph E of the training crowd image.
In one embodiment of the invention, to generateStep S42 is illustrated for purposes of example. First the decoding feature map F 4 ' performing a convolution operation of 1 x 1 generates a multi-scale decoded feature map. The multi-scale decoded feature map is then processedPerforming double up-sampling and adding the decoded feature map F 3 ' post-performing a 1 gamma 1 convolution operation generates a multi-scale decoded feature map. And similarly, a multi-scale decoding characteristic diagram can be obtained
In one embodiment of the invention, the multi-scale decoding feature mapAnd performing convolution operation of 1 gamma 1 and double up-sampling to obtain a distance conversion map E of the training crowd image.
S5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram E of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;
further, the step S5 includes the steps of:
step S51, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model;
step S52, constructing a loss calculation module, and inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module;
in one embodiment of the present invention, when training is performed in the UCF-QNRF database, the Loss function Loss of the constructed Loss calculation module is expressed as:
L= L MSE (E,G) + γ L MSSSIM (E,G)
wherein L represents total loss, L MSE (E, G) represents MSE loss, L MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E q Distance conversion map representing the q-th training crowd image, G q Label distance conversion map representing the q-th training crowd image L SSIM (E qnm , G qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.
In one embodiment of the invention, the SSIM loses L SSIM Can be expressed as:
wherein E represents a distance conversion map of the training crowd image, G represents a label distance conversion map of the training crowd image, μ E Mean value mu of distance conversion diagram representing training crowd image G Mean value sigma of label distance conversion graph representing training crowd image E Variance, sigma, of distance conversion map representing training crowd image G Variance, sigma, of a label distance conversion map representing images of a training population EG Distance representing images of training crowdCovariance between the distance from the transformation map and the label distance transformation map, phi 1 And phi 2 Is a constant.
In one embodiment of the present invention, γ=0.1, Φ 1 = 1╳10 -4 ,φ 2 = 9╳10 -4
And step S53, optimizing the crowd image trans-scale visual transformation network positioning model by using the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model.
In the step, iterative computation can be performed by means of a random gradient descent method to optimize the crowd image trans-scale visual transformation network positioning model.
And S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.
In an embodiment of the present invention, the step of obtaining the positioning result of the input crowd image after performing post-processing on the distance conversion map may include the following steps:
step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;
step S62, setting two threshold values, namely a first threshold value T max And a second threshold T min Wherein T is max >T min Comparing the local maximum value in the distance conversion map with the first threshold value and the second threshold value, and making the local maximum value larger than the first threshold value T max Is identified as an individual head point if the global maximum of the distance conversion map is less than a second threshold T min And confirming that no person exists in the input crowd image.
In one embodiment of the invention, T max 110/255 times of the global maximum of the distance conversion graph, T min =0.1。
The crowd image positioning accuracy rate of the invention reaches 85.6% (Average Precision), 80.6% (Average return) and 83.1% (Average F1-measure) when the crowd image large database disclosed on the internet is used as a test object, for example, when the crowd image large database is tested on the UCF-QNRF database. According to the method, the multi-scale feature map of the crowd image is learned through the convolutional neural network, the multi-scale feature map is fused by the multi-scale coding fusion module to generate a multi-scale coding sequence, the multi-scale coding sequence is modeled by the cross window transformation network module to generate a multi-scale long-distance dependency relationship sequence, and finally the multi-scale long-distance dependency relationship sequence is fused by the multi-scale decoding fusion module, so that the multi-scale feature information and the long-distance dependency relationship are learned at the same time, the accuracy of locating the crowd image is improved to a great extent, and the effectiveness of the method is seen.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explanation of the principles of the present invention and are in no way limiting of the invention. Accordingly, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention should be included in the scope of the present invention. Furthermore, the appended claims are intended to cover all such changes and modifications that fall within the scope and boundary of the appended claims, or equivalents of such scope and boundary.

Claims (10)

1. A crowd positioning method based on a trans-scale visual transformation network, the method comprising the steps of:
step S1, a feature extraction module is constructed by utilizing a pre-training deep learning model, label processing is carried out on training crowd images, a label distance conversion diagram corresponding to the training crowd images is obtained, the training crowd images are input into the feature extraction module, and a multi-scale feature diagram of the training crowd images is obtained;
s2, constructing a multi-scale coding fusion module, inputting the multi-scale feature images of the training crowd images into the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd images;
s3, constructing a cross window transformation network module, inputting the multi-scale coding sequence of the training crowd image into the cross window transformation network module, and learning long-distance dependency relations of different scales by utilizing the cross window transformation network module to obtain the multi-scale long-distance dependency relation sequence of the training crowd image;
s4, constructing a multi-scale decoding fusion module, inputting the multi-scale long-distance dependency relationship sequence of the training crowd image into the multi-scale decoding fusion module, and fusing the multi-scale long-distance dependency relationship sequence by using the multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image;
s5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;
and S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.
2. The method according to claim 1, wherein the step S1 comprises the steps of:
step S11, determining VGG-16 as a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;
step S12, preprocessing and label processing are carried out on the training crowd images to obtain a label distance conversion chart corresponding to the training crowd images;
and S13, inputting the preprocessed training crowd images into the feature extraction module to obtain a multi-scale feature map of the training crowd images.
3. The method according to claim 1 or 2, wherein the label distance conversion map corresponding to the training crowd image is represented as:
wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.
4. The method according to claim 1, wherein said step S2 comprises the steps of:
s21, fusing the multi-scale feature images of the training crowd images at corresponding positions and the multi-scale feature images shallower than the multi-scale feature images, and converting the multi-scale feature images into a multi-scale sequence;
and S22, performing dimension reduction processing on the multi-scale feature map to convert the multi-scale feature map into feature vectors, and adding and fusing the feature vectors and the multi-scale sequences to obtain the multi-scale coding sequences of the training crowd images.
5. The method according to claim 1, wherein said step S3 comprises the steps of:
step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure, wherein the layer standardization module is used for carrying out data distribution standardization, the cross window multi-head self-attention module is used for learning global dependency relations, the multi-layer perceptron module is used for reducing the number of parameters, and the residual error structure acts on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module and is used for relieving the problem of gradient disappearance or explosion;
step S32, connecting B cross window basic units in parallel, and constructing to obtain the cross window transformation network module;
and step S33, inputting the multi-scale coding sequence to the cross window transformation network module, and constructing long-distance dependency relations of different scales on the multi-scale coding sequence by utilizing the cross window transformation network module to obtain a multi-scale long-distance dependency relation sequence of the training crowd image.
6. The method of claim 5, wherein the multi-scale long-range dependency sequence is expressed as:
L i ' = MLP (LN(S i ) )+ S i
S i = CSW in (LN(L i ) )+ L i
wherein L is i ' represents a multi-scale long-range dependency sequence, L i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW in Representing a cross window multi-headed self-attention module.
7. The method according to claim 1, wherein said step S4 comprises the steps of:
s41, fusing the multi-scale long-distance dependency relationship sequence of the training crowd image and the multi-scale long-distance dependency relationship sequence deeper than the multi-scale long-distance dependency relationship sequence, and converting the multi-scale long-distance dependency relationship sequence into a decoding characteristic diagram;
and step S42, fusing the decoding characteristic diagram and a decoding characteristic diagram of a layer deeper than the decoding characteristic diagram to obtain a multi-scale decoding characteristic diagram of the training crowd image, and performing convolution operation of 1 gamma 1 and double up-sampling on the multi-scale decoding characteristic diagram to obtain a distance conversion diagram of the training crowd image.
8. The method of claim 1, wherein the loss function in the loss calculation module is expressed as:
L= L MSE (E,G) + γ L MSSSIM (E,G);
wherein L represents total loss, L MSE (E, G) represents MSE loss, L MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E q Distance conversion map representing the q-th training crowd image, G q Label distance conversion map representing the q-th training crowd image L SSIM (E qnm , G qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.
9. The method of claim 8, wherein the SSIM loses L SSIM Expressed as:
wherein mu E Mean value mu of distance conversion diagram representing training crowd image G Mean value sigma of label distance conversion graph representing training crowd image E Variance, sigma, of distance conversion map representing training crowd image G Label distance conversion representing training crowd imagesVariance, sigma of graph EG Covariance between distance conversion map and label distance conversion map representing training crowd image, phi 1 And phi 2 Is a constant.
10. The method of claim 1, wherein the performing post-processing on the distance conversion map to obtain a positioning result of the input crowd image comprises:
step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;
step S62, a first threshold value and a second threshold value are set, wherein the first threshold value is greater than the second threshold value, the local maximum value in the distance conversion graph is compared with the first threshold value and the second threshold value, the point with the local maximum value larger than the first threshold value is confirmed to be an individual head point, and if the global maximum value of the distance conversion graph is smaller than the second threshold value, no person in the input crowd image is confirmed.
CN202311074895.9A 2023-08-25 2023-08-25 Crowd positioning method based on trans-scale visual transformation network Active CN116805337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311074895.9A CN116805337B (en) 2023-08-25 2023-08-25 Crowd positioning method based on trans-scale visual transformation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311074895.9A CN116805337B (en) 2023-08-25 2023-08-25 Crowd positioning method based on trans-scale visual transformation network

Publications (2)

Publication Number Publication Date
CN116805337A CN116805337A (en) 2023-09-26
CN116805337B true CN116805337B (en) 2023-10-27

Family

ID=88079738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311074895.9A Active CN116805337B (en) 2023-08-25 2023-08-25 Crowd positioning method based on trans-scale visual transformation network

Country Status (1)

Country Link
CN (1) CN116805337B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804992A (en) * 2017-05-08 2018-11-13 电子科技大学 A kind of Demographics' method based on deep learning
CN111242036A (en) * 2020-01-14 2020-06-05 西安建筑科技大学 Crowd counting method based on encoding-decoding structure multi-scale convolutional neural network
CN113139489A (en) * 2021-04-30 2021-07-20 广州大学 Crowd counting method and system based on background extraction and multi-scale fusion network
CN114120361A (en) * 2021-11-19 2022-03-01 西南交通大学 Crowd counting and positioning method based on coding and decoding structure
CN114445765A (en) * 2021-12-23 2022-05-06 上海师范大学 Crowd counting and density estimating method based on coding and decoding structure
CN115311508A (en) * 2022-08-09 2022-11-08 北京邮电大学 Single-frame image infrared dim target detection method based on depth U-type network
CN116091764A (en) * 2022-12-28 2023-05-09 天津师范大学 Cloud image segmentation method based on fusion transformation network
CN116246305A (en) * 2023-01-31 2023-06-09 天津师范大学 Pedestrian retrieval method based on hybrid component transformation network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804992A (en) * 2017-05-08 2018-11-13 电子科技大学 A kind of Demographics' method based on deep learning
CN111242036A (en) * 2020-01-14 2020-06-05 西安建筑科技大学 Crowd counting method based on encoding-decoding structure multi-scale convolutional neural network
CN113139489A (en) * 2021-04-30 2021-07-20 广州大学 Crowd counting method and system based on background extraction and multi-scale fusion network
CN114120361A (en) * 2021-11-19 2022-03-01 西南交通大学 Crowd counting and positioning method based on coding and decoding structure
CN114445765A (en) * 2021-12-23 2022-05-06 上海师范大学 Crowd counting and density estimating method based on coding and decoding structure
CN115311508A (en) * 2022-08-09 2022-11-08 北京邮电大学 Single-frame image infrared dim target detection method based on depth U-type network
CN116091764A (en) * 2022-12-28 2023-05-09 天津师范大学 Cloud image segmentation method based on fusion transformation network
CN116246305A (en) * 2023-01-31 2023-06-09 天津师范大学 Pedestrian retrieval method based on hybrid component transformation network

Also Published As

Publication number Publication date
CN116805337A (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN111539370A (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
US10943352B2 (en) Object shape regression using wasserstein distance
CN113298815A (en) Semi-supervised remote sensing image semantic segmentation method and device and computer equipment
CN114863348B (en) Video target segmentation method based on self-supervision
CN111626134A (en) Dense crowd counting method, system and terminal based on hidden density distribution
CN115393690A (en) Light neural network air-to-ground observation multi-target identification method
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN116246338A (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN117576724A (en) Unmanned plane bird detection method, system, equipment and medium
CN115035599A (en) Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics
CN114550014A (en) Road segmentation method and computer device
Feng Mask RCNN-based single shot multibox detector for gesture recognition in physical education
CN117765258A (en) Large-scale point cloud semantic segmentation method based on density self-adaption and attention mechanism
CN113762331A (en) Relational self-distillation method, apparatus and system, and storage medium
CN113989612A (en) Remote sensing image target detection method based on attention and generation countermeasure network
Oh et al. Local selective vision transformer for depth estimation using a compound eye camera
CN117237411A (en) Pedestrian multi-target tracking method based on deep learning
CN116805337B (en) Crowd positioning method based on trans-scale visual transformation network
CN116665293A (en) Sitting posture early warning method and system based on monocular vision
CN115984132A (en) Short-term prediction method based on CBAIM differential recurrent neural network
CN114913485A (en) Multi-level feature fusion weak supervision detection method
CN114694042A (en) Disguised person target detection method based on improved Scaled-YOLOv4
CN113706650A (en) Image generation method based on attention mechanism and flow model
CN113449611B (en) Helmet recognition intelligent monitoring system based on YOLO network compression algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant