CN116805337B

CN116805337B - Crowd positioning method based on trans-scale visual transformation network

Info

Publication number: CN116805337B
Application number: CN202311074895.9A
Authority: CN
Inventors: 张重; 连宇; 刘爽; 郭蓬; 高嵩
Original assignee: Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd; Tianjin Normal University; CATARC Tianjin Automotive Engineering Research Institute Co Ltd
Current assignee: Guoqi Beijing Intelligent Network Association Automotive Research Institute Co ltd; Tianjin Normal University; CATARC Tianjin Automotive Engineering Research Institute Co Ltd
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-10-27
Anticipated expiration: 2043-08-25
Also published as: CN116805337A

Abstract

The embodiment of the invention discloses a crowd positioning method based on a trans-scale visual transformation network, which comprises the following steps: constructing a feature extraction module to obtain a multi-scale feature map of the training crowd image; constructing a multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd image; constructing a cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence of the training crowd image; constructing a multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image; constructing a loss calculation module to obtain a cross-scale visual transformation network positioning model of the optimal crowd image; and obtaining a positioning result of the input crowd image by utilizing the optimal crowd image trans-scale vision transformation network positioning model. The invention combines the convolutional neural network and the cross window transformation network module, simultaneously learns the multi-scale characteristic information and the long-distance dependency relationship from the crowd image, and further improves the positioning accuracy of the crowd image.

Description

Crowd positioning method based on trans-scale visual transformation network

Technical Field

The invention belongs to the fields of digital image processing, computer vision, pattern recognition and artificial intelligence, and particularly relates to a crowd positioning method based on a trans-scale vision transformation network.

Background

Crowd analysis can prevent crowd gathering and trampling accidents, and has great potential in improving public and traffic safety. Crowd localization is a critical task of crowd analysis that aims to predict the location of each individual while estimating the total number of individuals in the crowd. Crowd locating provides detailed information of the spatial distribution of the crowd, and can provide effective crowd management and emergency response, as compared to crowd counting that only estimates the total number of individuals. Crowd positioning faces significant challenges such as illumination, occlusion, and perspective effects. Thus, many approaches have been proposed to overcome these challenges. These methods fall into three main categories: detection-based methods, regression-based methods, and map-based methods.

Most detection-based methods utilize point-level annotations to generate pseudo-bounding boxes. Liu et al initialize the size of the pseudo-bounding box with the nearest neighbor distance between head center points and adjust the pseudo-bounding box by iterative update to train a reliable target detector. Considering that the size of a human head is related to the distance between the head and the camera, lian et al predict the size of a pseudo-bounding box using depth information. However, in very dense scenes, the above approach does not perform well in crowd-locating tasks due to occlusion and blurring.

The regression-based method may directly regress the coordinates of the head points and output a confidence score. Song et al propose a point-based counting and positioning framework that predicts a set of candidate point representations headers directly based on predefined anchor points. Liang et al propose an end-to-end crowd-location model that uses trainable query instances rather than a number of predefined anchor points. However, the above method lacks correlation information between the head point and other pixels, and thus positioning performance is not accurate enough.

The map-based approach can generate a trainable map that can reflect the relationship between head points and adjacent pixels to guide model training. Idress et al use a density map for head positioning, where the local area maximum of the density map is the head point location. Abousamra et al propose a topology that expands the head points slightly into a point mask and uses the point mask map as supervisory information. Xu et al generate a distance tag map according to the distance between head points, which can avoid the problem of head overlap in dense areas. Liang et al propose an inverse focal distance map that better represents the correlation between head points and other pixels. The above method makes full use of spatial information but does not take into account complete multi-scale information.

In the field of crowd analysis, some methods implement a crowd counting and locating method based on a transformation network. For example, gao et al propose to expand a convolution shift window transformation network for crowd localization, which learns feature maps using a shift window transformation network and a feature pyramid network. Lin et al combine global attention and local attention for crowd counting within a transformation network framework. However, the above-described methods do not take into account complete multi-scale information in the learning process. In contrast, the method of the present disclosure considers multi-scale information in both the encoding stage, decoding stage, and loss function.

Disclosure of Invention

The invention aims to solve the technical problem that the multi-scale problem of the head in the crowd image has a great influence on the positioning result, and therefore, the invention provides a crowd positioning method based on a trans-scale visual transformation network.

In order to achieve the purpose, the invention provides a crowd positioning method based on a trans-scale visual transformation network, which comprises the following steps:

step S1, a feature extraction module is constructed by utilizing a pre-training deep learning model, label processing is carried out on training crowd images, a label distance conversion diagram corresponding to the training crowd images is obtained, the training crowd images are input into the feature extraction module, and a multi-scale feature diagram of the training crowd images is obtained;

s2, constructing a multi-scale coding fusion module, inputting the multi-scale feature images of the training crowd images into the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence of the training crowd images;

s3, constructing a cross window transformation network module, inputting the multi-scale coding sequence of the training crowd image into the cross window transformation network module, and learning long-distance dependency relations of different scales by utilizing the cross window transformation network module to obtain the multi-scale long-distance dependency relation sequence of the training crowd image;

s4, constructing a multi-scale decoding fusion module, inputting the multi-scale long-distance dependency relationship sequence of the training crowd image into the multi-scale decoding fusion module, and fusing the multi-scale long-distance dependency relationship sequence by using the multi-scale decoding fusion module to obtain a multi-scale decoding feature map and a distance conversion map of the training crowd image;

s5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;

and S6, in the testing stage, calculating a distance conversion diagram of the input crowd image by using the optimal crowd image cross-scale visual transformation network positioning model, and performing post-processing on the distance conversion diagram to obtain a positioning result of the input crowd image.

Optionally, the step S1 includes the steps of:

step S11, determining VGG-16 as a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;

step S12, preprocessing and label processing are carried out on the training crowd images to obtain a label distance conversion chart corresponding to the training crowd images;

and S13, inputting the preprocessed training crowd images into the feature extraction module to obtain a multi-scale feature map of the training crowd images.

Optionally, the label distance conversion map corresponding to the training crowd image is expressed as:

wherein F (x, y) represents a label distance conversion diagram obtained after label processing of the training crowd image, (x, y) represents training crowd image pixel coordinates, P (x, y) represents training crowd image pixel values, alpha and beta are adjustable parameters, and C is a constant.

Optionally, the step S2 includes the steps of:

s21, fusing the multi-scale feature images of the training crowd images at corresponding positions and the multi-scale feature images shallower than the multi-scale feature images, and converting the multi-scale feature images into a multi-scale sequence;

and S22, performing dimension reduction processing on the multi-scale feature map to convert the multi-scale feature map into feature vectors, and adding and fusing the feature vectors and the multi-scale sequences to obtain the multi-scale coding sequences of the training crowd images.

Optionally, the step S3 includes the steps of:

step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure, wherein the layer standardization module is used for carrying out data distribution standardization, the cross window multi-head self-attention module is used for learning global dependency relations, the multi-layer perceptron module is used for reducing the number of parameters, and the residual error structure acts on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module and is used for relieving the problem of gradient disappearance or explosion;

step S32, connecting B cross window basic units in parallel, and constructing to obtain the cross window transformation network module;

and step S33, inputting the multi-scale coding sequence to the cross window transformation network module, and constructing long-distance dependency relations of different scales on the multi-scale coding sequence by utilizing the cross window transformation network module to obtain a multi-scale long-distance dependency relation sequence of the training crowd image.

Optionally, the multi-scale long-range dependency sequence is expressed as:

L _i ' = MLP (LN(S _i ) )+ S _i

S _i = CSW _in (LN(L _i ) )+ L _i

wherein L is _i ' represents a multi-scale long-range dependency sequence, L _i Representing a multi-scale coding sequence, LN representing a layer normalization module, MLP representing a multi-layer perceptron module, CSW _in Representing a cross window multi-headed self-attention module.

Optionally, the step S4 includes the steps of:

s41, fusing the multi-scale long-distance dependency relationship sequence of the training crowd image and the multi-scale long-distance dependency relationship sequence deeper than the multi-scale long-distance dependency relationship sequence, and converting the multi-scale long-distance dependency relationship sequence into a decoding characteristic diagram;

and step S42, fusing the decoding characteristic diagram and a decoding characteristic diagram of a layer deeper than the decoding characteristic diagram to obtain a multi-scale decoding characteristic diagram of the training crowd image, and performing convolution operation of 1 gamma 1 and double up-sampling on the multi-scale decoding characteristic diagram to obtain a distance conversion diagram of the training crowd image.

Optionally, the loss function in the loss calculation module is expressed as:

L= L _MSE (E,G) + γ L _MSSSIM (E,G)

wherein L represents total loss, L _MSE (E, G) represents MSE loss, L _MSSSIM (E, G) represents a multiscale SSIM loss, E represents a distance conversion map of training crowd images, G represents a label distance conversion map of training crowd images, gamma is an adjustable parameter, Q represents the number of training crowd images, E _q Distance conversion map representing the q-th training crowd image, G _q Label distance conversion map representing the q-th training crowd image L _SSIM (E _qnm , G _qnm ) Representing SSIM loss, N represents the number of individuals in a single training crowd image, M represents the number of windows selected to perform SSIM loss, E _qnm A distance conversion chart under the mth window of the nth person in the q-th training crowd image, G _qnm And (5) representing a label distance conversion diagram under an mth window of an nth person in the q-th training crowd image.

Optionally, the SSIM lossExpressed as:

wherein mu _E Mean value mu of distance conversion diagram representing training crowd image _G Mean value sigma of label distance conversion graph representing training crowd image _E Variance, sigma, of distance conversion map representing training crowd image _G Variance, sigma, of a label distance conversion map representing images of a training population _EG Covariance between distance conversion map and label distance conversion map representing training crowd image, phi ₁ And phi ₂ Is a constant.

Optionally, the obtaining the positioning result of the input crowd image after performing post-processing on the distance conversion map includes:

step S61, obtaining all local maximum points in the distance conversion map through 3 gamma 3 maximum pooling;

step S62, a first threshold value and a second threshold value are set, wherein the first threshold value is greater than the second threshold value, the local maximum value in the distance conversion graph is compared with the first threshold value and the second threshold value, the point with the local maximum value larger than the first threshold value is confirmed to be an individual head point, and if the global maximum value of the distance conversion graph is smaller than the second threshold value, no person in the input crowd image is confirmed.

The beneficial effects of the invention are as follows: the multi-scale feature map of the crowd image is extracted through the convolutional neural network, the multi-scale feature map is fused by the multi-scale coding fusion module to obtain a multi-scale coding sequence, then the multi-scale coding sequence is modeled with a long-distance dependency relationship by the cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence, and finally the multi-scale long-distance dependency relationship sequence is fused by the multi-scale decoding fusion module. This improves the representation capability of the multi-scale feature map and improves the accuracy of crowd image localization.

Drawings

FIG. 1 is a flow chart of a crowd positioning method based on a cross-scale visual transformation network according to one embodiment of the invention.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

Fig. 1 is a flowchart of a crowd positioning method based on a cross-scale visual transformation network according to an embodiment of the present invention, and taking fig. 1 as an example, a plurality of implementation processes of the present invention are described below, and as shown in fig. 1, the crowd positioning method based on the cross-scale visual transformation network includes the following steps:

step S1, constructing a feature extraction module by utilizing a pre-training deep learning model, performing label processing on training crowd images to obtain a label distance conversion diagram corresponding to the training crowd images, and inputting the training crowd images into the feature extraction module to obtain a multi-scale feature diagram F of the training crowd images _i ；

Further, the step S1 includes the steps of:

step S11, determining a pre-training deep learning model, initializing parameters of the pre-training deep learning model, and removing a final global pooling layer and a full connection layer in the pre-training deep learning model to obtain the feature extraction module;

in an embodiment of the present invention, the pre-training deep learning model may be VGG-16, and after parameter initialization, a model component before the last global pooling layer in the model is selected, that is, the last global pooling layer and the full connection layer in the pre-training deep learning model are removed, and the remaining model part forms a feature extraction module. In an embodiment of the present invention, the feature extraction module is composed of 4 modules, which may be named Stage1, stage2, stage3 and Stage4, and the feature extraction module generates feature graphs with different scales from Stage1, stage2, stage3 and Stage4, respectively.

Step S12, preprocessing and label processing are carried out on the training crowd images in a training set, and a label distance conversion chart corresponding to the training crowd images is obtained;

in an embodiment of the present invention, the preprocessing the training crowd image may include: the training crowd image is randomly and horizontally turned over, the probability is set to be 0.5, all pixels of the training crowd image are scaled down to be within a preset range, for example, between 0 and 1, then the average value of the pixels of the training crowd image is subtracted from each pixel value in the training crowd image, the average value of the pixels of the training crowd image is divided by the variance of the pixels of the training crowd image, finally the training crowd image is cut to be a fixed size H W, wherein H is the height of the cut training crowd image, W is the width of the cut training crowd image, and in one embodiment of the invention, H=256 and W=256.

In an embodiment of the present invention, a label distance conversion chart corresponding to the training crowd image obtained after label processing is performed on the training crowd image may be expressed as:

In one embodiment of the present invention, α=0.02, β=0.75, and c=1.

Step S13, inputting the training crowd image obtained after preprocessing into the feature extraction module to obtain a multi-scale feature map F of the training crowd image _i 。

In one embodiment of the present invention, the multi-scale feature mapWherein C is _i For training the channel number of the ith scale feature map of the crowd image, for example, C can be set ₁ =128，C ₂ =256，C ₃ =512，C ₄ =512, at which point,，，andoutput from Stage1, stage2, stage3 and Stage4, respectively, in the feature extraction module.

Step S2, constructEstablishing a multi-scale coding fusion module, and carrying out multi-scale feature map F on the training crowd image _i Inputting the images to the multi-scale coding fusion module, and fusing the multi-scale feature images by using the multi-scale coding fusion module to obtain a multi-scale coding sequence L of the training crowd images _i ；

Further, the step S2 includes the steps of:

step S21, a multi-scale feature map F of the training crowd image is obtained _i Multiscale feature map F at a corresponding location and shallower than it _j (0<j<i) Fusion and conversion to Multi-Scale sequence G _i ；

In one embodiment of the invention, F ₄ G generation ₄ Step S21 is described as an example. Because convolution computation is involved in the multi-scale feature map extraction process, the multi-scale feature map F ₄ 1 x 1 region of (a) corresponds to the multiscale feature map F _i 2 in (i=1, 2, 3) ^4-I ╳ 2 ^4-i An area. First, a multi-scale feature map F ₄ Conversion into a sequenceNamely by C ₄ Personal (S)The sequence sum of the feature vector components of the dimension is used for obtaining the residual shallow multi-scale feature map F _i (i=1, 2, 3) to sequences respectivelyWherein N is _i = 2 ^4-I ╳ 2 ^4-i . The sequence D is then applied using a linear layer _i Conversion into a sequence. Finally the sequence D _i ^′ (i=1, 2,3, 4) to obtain a multi-scale sequence. Can be obtained in the same way。

Step S22, for the multi-scale feature map F _i Performing dimension reduction processing to convert the dimension reduction processing into feature vectors and combining the feature vectors with the multi-scale sequence G _i Adding and fusing to obtain a multi-scale coding sequence L of the training crowd images _i 。

S3, constructing a cross window transformation network module, and enabling the multi-scale coding sequence L of the training crowd image to be _i Inputting the images to the cross window transformation network module, and learning long-distance dependency relationships with different scales by using the cross window transformation network module to obtain a multi-scale long-distance dependency relationship sequence L of the training crowd images _i '；

Further, the step S3 includes the steps of:

step S31, constructing a cross window basic unit based on a layer standardization module, a cross window multi-head self-attention module, a multi-layer perceptron module and a residual error structure;

in the step, firstly, a layer standardization module is utilized to conduct data distribution standardization so as to accelerate model convergence, then a cross window multi-head self-attention module is utilized to learn global dependency, finally, a multi-layer perceptron module is utilized to reduce the number of parameters, and in addition, a residual structure can be utilized to relieve the problem of gradient disappearance or explosion, for example, the residual structure can act on the outputs of the cross window multi-head self-attention module and the multi-layer perceptron module. Wherein the layer normalization module, residual error structure and multi-layer perceptron module are common computing modules in the art, which are not described too much in this disclosure.

Step S32, utilizeConstructing a cross window basic unit to obtain the cross window transformation network module;

wherein the number B of the basic units of the cross window and the multi-scale coding sequence L _i Corresponding to the number of scales of (a), in one embodiment of the invention, the multi-scale code sequence L _i Is 4, so b=4. Wherein,,the B cross window basic units are connected in parallel to obtain the cross window transformation network module, and each cross window basic unit corresponds to L on one scale of the training crowd image _i 。

Step S33, the multi-scale coding sequence L is processed _i Inputting the multi-scale code sequence L to the cross window transformation network module, and utilizing the cross window transformation network module to perform multi-scale code sequence L _i Constructing long-distance dependency relations of different scales to obtain a multi-scale long-distance dependency relation sequence L of the training crowd image _i '。

In one embodiment of the invention, the multi-scale long-distance dependency sequence L _i ' can be expressed as:

L _i ' = MLP (LN(S _i ) )+ S _i

S _i = CSW _in (LN(L _i ) )+ L _i

In one embodiment of the present invention, the cross window multi-head self-attention module is composed of a horizontal stripe self-attention module and a vertical stripe self-attention module, and the i-th-scale multi-scale coding sequence L _i The output of the multi-headed self-attention module at the cross window is expressed as:

CSW _in (L _i ) = Concat (head ₁ , …, head _k , …, head _K ) W ⁰

wherein CSW _in Representing a cross window multi-head self-attention module, concat representing a cascading function, K representing the number of heads in the cross window multi-head self-attention module, head _k Multi-headed self-attention, W, of cross window representing the kth head ⁰ Representing the mapping momentArray, H _Att-k (L _i ) Representing horizontal stripe self-attention, V _Att-k (L _i ) Representing vertical streak self-attention.

In one embodiment of the present invention, k=2.

In one embodiment of the present invention, the i-th scale multi-scale coding sequence L _i The horizontal stripe self-attention output of (2) is expressed as:

wherein Q is _ik ^a = (L _i ^a ) ^T W _ik ^Q ，K _ik ^a = (L _i ^a ) ^T W _ik ^K ，V _ik ^a = (L _i ^a ) ^T W _ik ^V Respectively, query, key and value, L _i ^a Is L _i The horizontal stripe sequence in the matrix A is L _i Number of horizontal stripe sequences within, W _ik ^Q ，W _ik ^K Andmapping matrix of query, key and value, respectively, C _i Channel number d of ith scale feature image of training crowd image _k Is the dimension of the kth head in the cross window multi-head self-attention. Similarly, vertical stripe self-attention V _Att-k (L _i )。

S4, constructing a multi-scale decoding fusion module, and enabling the multi-scale long-distance dependency relationship sequence L of the training crowd image to be _i ' input to the multi-scale decoding fusion module, and utilize the multi-scale decoding fusion module to perform multi-scale long-distance dependency relationship sequence L _i ' fusion is carried out to obtain a multi-scale decoding feature map of the training crowd imageAnd a distance conversion map E.

Further, the step S4 includes the steps of:

step S41, a multi-scale long-distance dependency relationship sequence L of the training crowd image is obtained _i ' and deeper multiscale long range dependency sequence L _j ' (j>i>0) Fusion and conversion to decoding feature map F _i '；

In one embodiment of the invention, F is generated ₁ ' step S41 is illustrated. Firstly, the multi-scale long-distance dependency relation sequence L of the training crowd image is obtained _i ' transform to and the multiscale feature map F _i Feature map Z of the same size _i (i=1, 2,3, 4). The feature map Z is then up-sampled _i (i=1, 2,3, 4) transformed into and said multiscale feature map F ₁ The feature images with the same size are added to obtain a decoding feature image F ₁ ' decoding feature map F can be obtained by the same method _i ' (i=2,3,4)。

Step S42, the decoding characteristic diagram F _i ' sum of decoding feature map F of a layer deeper than it _i+1 'fusion' to obtain a multi-scale decoding feature map of the training crowd imageFor multi-scale decoding feature mapAnd performing convolution operation and double up-sampling of the 1 gamma 1 to obtain a distance conversion graph E of the training crowd image.

In one embodiment of the invention, to generateStep S42 is illustrated for purposes of example. First the decoding feature map F ₄ ' performing a convolution operation of 1 x 1 generates a multi-scale decoded feature map. The multi-scale decoded feature map is then processedPerforming double up-sampling and adding the decoded feature map F ₃ ' post-performing a 1 gamma 1 convolution operation generates a multi-scale decoded feature map. And similarly, a multi-scale decoding characteristic diagram can be obtained。

In one embodiment of the invention, the multi-scale decoding feature mapAnd performing convolution operation of 1 gamma 1 and double up-sampling to obtain a distance conversion map E of the training crowd image.

S5, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model, constructing a loss calculation module, inputting a distance conversion diagram E of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module, and optimizing the crowd image trans-scale visual transformation network positioning model by utilizing the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model;

further, the step S5 includes the steps of:

step S51, sequentially connecting the feature extraction module, the multi-scale coding fusion module, the cross window transformation network module and the multi-scale decoding fusion module to form a crowd image trans-scale visual transformation network positioning model;

step S52, constructing a loss calculation module, and inputting a distance conversion diagram of the training crowd image and a label distance conversion diagram of the training crowd image into the loss calculation module;

in one embodiment of the present invention, when training is performed in the UCF-QNRF database, the Loss function Loss of the constructed Loss calculation module is expressed as:

L= L _MSE (E,G) + γ L _MSSSIM (E,G)

In one embodiment of the invention, the SSIM loses L _SSIM Can be expressed as:

wherein E represents a distance conversion map of the training crowd image, G represents a label distance conversion map of the training crowd image, μ _E Mean value mu of distance conversion diagram representing training crowd image _G Mean value sigma of label distance conversion graph representing training crowd image _E Variance, sigma, of distance conversion map representing training crowd image _G Variance, sigma, of a label distance conversion map representing images of a training population _EG Distance representing images of training crowdCovariance between the distance from the transformation map and the label distance transformation map, phi ₁ And phi ₂ Is a constant.

In one embodiment of the present invention, γ=0.1, Φ ₁ = 1╳10 ^-4 ，φ ₂ = 9╳10 ^-4 。

And step S53, optimizing the crowd image trans-scale visual transformation network positioning model by using the obtained loss value to obtain an optimal crowd image trans-scale visual transformation network positioning model.

In the step, iterative computation can be performed by means of a random gradient descent method to optimize the crowd image trans-scale visual transformation network positioning model.

In an embodiment of the present invention, the step of obtaining the positioning result of the input crowd image after performing post-processing on the distance conversion map may include the following steps:

step S62, setting two threshold values, namely a first threshold value T _max And a second threshold T _min Wherein T is _max >T _min Comparing the local maximum value in the distance conversion map with the first threshold value and the second threshold value, and making the local maximum value larger than the first threshold value T _max Is identified as an individual head point if the global maximum of the distance conversion map is less than a second threshold T _min And confirming that no person exists in the input crowd image.

In one embodiment of the invention, T _max 110/255 times of the global maximum of the distance conversion graph, T _min =0.1。

The crowd image positioning accuracy rate of the invention reaches 85.6% (Average Precision), 80.6% (Average return) and 83.1% (Average F1-measure) when the crowd image large database disclosed on the internet is used as a test object, for example, when the crowd image large database is tested on the UCF-QNRF database. According to the method, the multi-scale feature map of the crowd image is learned through the convolutional neural network, the multi-scale feature map is fused by the multi-scale coding fusion module to generate a multi-scale coding sequence, the multi-scale coding sequence is modeled by the cross window transformation network module to generate a multi-scale long-distance dependency relationship sequence, and finally the multi-scale long-distance dependency relationship sequence is fused by the multi-scale decoding fusion module, so that the multi-scale feature information and the long-distance dependency relationship are learned at the same time, the accuracy of locating the crowd image is improved to a great extent, and the effectiveness of the method is seen.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explanation of the principles of the present invention and are in no way limiting of the invention. Accordingly, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention should be included in the scope of the present invention. Furthermore, the appended claims are intended to cover all such changes and modifications that fall within the scope and boundary of the appended claims, or equivalents of such scope and boundary.

Claims

1. A crowd positioning method based on a trans-scale visual transformation network, the method comprising the steps of:

2. The method according to claim 1, wherein the step S1 comprises the steps of:

3. The method according to claim 1 or 2, wherein the label distance conversion map corresponding to the training crowd image is represented as:

；

4. The method according to claim 1, wherein said step S2 comprises the steps of:

5. The method according to claim 1, wherein said step S3 comprises the steps of:

6. The method of claim 5, wherein the multi-scale long-range dependency sequence is expressed as:

L _i ' = MLP (LN(S _i ) )+ S _i ；

S _i = CSW _in (LN(L _i ) )+ L _i ；

7. The method according to claim 1, wherein said step S4 comprises the steps of:

8. The method of claim 1, wherein the loss function in the loss calculation module is expressed as:

L= L _MSE (E,G) + γ L _MSSSIM (E,G)；

；

9. The method of claim 8, wherein the SSIM loses L _SSIM Expressed as:

；

wherein mu _E Mean value mu of distance conversion diagram representing training crowd image _G Mean value sigma of label distance conversion graph representing training crowd image _E Variance, sigma, of distance conversion map representing training crowd image _G Label distance conversion representing training crowd imagesVariance, sigma of graph _EG Covariance between distance conversion map and label distance conversion map representing training crowd image, phi ₁ And phi ₂ Is a constant.

10. The method of claim 1, wherein the performing post-processing on the distance conversion map to obtain a positioning result of the input crowd image comprises: