CN114550109B

CN114550109B - Pedestrian flow detection method and system

Info

Publication number: CN114550109B
Application number: CN202210454852.2A
Authority: CN
Inventors: 李金泽; 赵政杰; 张舒; 张宁
Original assignee: Institute of Microelectronics of CAS
Current assignee: Institute of Microelectronics of CAS
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-19
Anticipated expiration: 2042-04-28
Also published as: CN114550109A

Abstract

The invention relates to a pedestrian flow detection method and a pedestrian flow detection system, belongs to the technical field of image preprocessing and recognition, and solves the problems that the training time of an existing self-attention model exceeds the expected time, the two-classification effect of a full connection layer is poor and the like. The method comprises the following steps: acquiring a pre-training image set, a training image set and a verification image set, and amplifying the pre-training image set and the training image set in a data amplification mode; constructing a detection model, wherein the detection model comprises a self-attention module and a support vector machine replacing a full connection layer; pre-training the detection model by using a pre-training image set; formally training the pre-trained detection model by using a training image set, and then verifying by using a verification image set to generate a trained detection model; and acquiring an image to be detected, and sending the image to be detected to the trained detection model to acquire a recognition result. The pre-training of the pre-training data set significantly reduces training time and the use of support vector machines can improve the binary performance.

Description

Pedestrian flow detection method and system

Technical Field

The invention relates to the technical field of image preprocessing and recognition, in particular to a pedestrian flow detection method and system.

Background

In recent years, with the short development and breakthrough of deep learning, deep neural networks achieve good results in the visual field, the recognition accuracy is higher and higher, the recognition methods are the same, and most models are based on the structure of the CNN. Although the CNN has good recognition accuracy, the network structure becomes more and more complex, the number of network layers increases from tens of layers to hundreds of layers, and the training difficulty and the calculation amount are improved. The hundreds of billions of parameter volumes have severely limited the application of models deployed to embedded systems or mobile terminals.

The self-attention mechanism is another method for extracting image features, different from the method that a CNN obtains the global receptive field of an image by using sliding convolution and stacking layers, the self-attention mechanism simplifies a model by using two encoders, and meanwhile, the method directly extracts global correlation features based on query, key and value. However, the self-attention mechanism network has two problems: the first point is the problem of too long training time of the self-attention mechanism, and according to google's paper data, the training time of the self-attention model is about 3 days (under the condition of using 24 TPUs), which obviously greatly exceeds the expected time; the second point is that the self-attention model usually faces the multi-classification problem, and when the two-classification problem is faced, the effect cannot be achieved by only forcibly modifying the output of the last full-connection layer into 2 outputs.

Disclosure of Invention

In view of the foregoing analysis, embodiments of the present invention provide a pedestrian traffic detection method and system, so as to solve the problems that the training time of the existing self-attention model exceeds the expected time, and the effect of performing classification by using a full connection layer is poor.

In one aspect, an embodiment of the present invention provides a pedestrian traffic detection method, including: acquiring a pre-training image set, a training image set and a verification image set, wherein the pre-training image set and the training image set are amplified in a data amplification mode; constructing a detection model, wherein the detection model comprises a self-attention module and a support vector machine replacing a full connection layer; pre-training the detection model by using the amplified pre-training image set; performing formal training on the pre-trained detection model by using the amplified training image set, and then performing verification by using the verification image set to generate a trained detection model; and acquiring an image to be detected, and sending the image to be detected to the trained detection model to acquire a recognition result.

The beneficial effects of the above technical scheme are as follows: the training time required by the self-attention model can be obviously reduced by using the pre-training data set for pre-training and then using the training data set for training, and meanwhile, the performance of the self-attention model is not reduced. Compared with the traditional training method, the training method has great improvement. In addition, the support vector machine is used for replacing the full-connection layer of the last layer, so that the classification performance of the support vector machine can be greatly improved, the training time and the model scale are further reduced, and the precision is guaranteed.

Based on the further improvement of the method, the pre-training image set is a single pedestrian image; and the training image set is a crowd image, wherein the resolution of the training image set is higher than that of the pre-training image set.

Based on a further improvement of the above method, the augmenting the pre-training image set and the training image set using data augmentation comprises: amplifying the pre-training image set and the training image set by using a mirror image amplification matrix to identify the same target at different angles and different directions in the images; and amplifying the pre-training image set and the training image set by using a scaling and amplifying matrix so as to identify characters with different distances and scenes with different sizes in the images.

Based on further improvement of the method, the detection model further comprises a preprocessing module, wherein the preprocessing module is used for preprocessing the images in the amplified image set so as to strengthen the edge features of the target in the images; using the self-attention module to perform feature extraction on the preprocessed image; and performing secondary classification on the image features extracted from the attention module by using the support vector machine, wherein the amplified image set is an amplified pre-training image set in the pre-training process, or the amplified image set is an amplified training image set in the formal training process.

Based on a further improvement of the above method, the preprocessing the images in the amplified image set using the preprocessing module includes: and performing feature extraction on longitudinal texture of the images in the amplified image set by using the following first feature matrix: k1 = [1, 2, 1; 0, 0, 0; -1, -2, -1 ]; and performing feature extraction on the transverse texture of the images in the amplified image set by using the following second feature matrix: k2 = [1, 0, -1, 2, 0, -2, 1, 0, -1 ].

Based on further improvement of the method, the feature extraction of the preprocessed image by using the self-attention module comprises: preprocessing images in the amplified image set using the preprocessing module includes: segmenting an image in the augmented image set into a plurality of patches using the pre-processing module and flattening each patch to generate normalized picture featuresz ₀：

，

Wherein the content of the first and second substances,x _jadding a binary judgment mark in front of the first dimension of the picture (x _j) To judge whether the pedestrian is a pedestrian or not,

、

、…、

a plurality of small blocks which are respectively divided for each picture, each small block is 16 multiplied by 16,Efor embedding in a matrix, the shape isP ²•C）×DAnd anE _posA position coding vector, which represents that the position coding is carried out on the divided picture; normalizing the normalization using a multi-headed self-attention modulePerforming multi-head self-attention calculation with a residual error network on the picture characteristics to obtain characteristic calculation results; and classifying the feature calculation results by using a multilayer perceptron and extracting the coordinates of the target to be detected from the classified feature calculation results.

Based on a further improvement of the above method, the performing a second classification on the image features extracted from the attention module by using the support vector machine comprises: mapping the processed picture features into a high-dimensional space by using an RBF core:

detecting a sample in the high dimensional space to obtain a linear interface by:

wherein the content of the first and second substances,z _m, z _nfor the image features to be extracted by the self-attention module,

for extracted image featuresz _m, z _nThe standard deviation of (a) is determined,ωandbis the coefficient of the linear interface in question,ωis a high dimensional parameter matrix and depends on image characteristics,ω ^Tfor the transposed matrix, ζ is the scaling factor,x ⁱ⁽⁾，y ⁱ⁽⁾for the embedding of samples in the high-dimensional space,Fis a hyper-parameter of the support vector machine.

Based on further improvement of the method, the pre-training of the detection model by using the amplified pre-training image set comprises: performing forward propagation training of preset model parameters on the detection model by using the amplified pre-training image set, and calculating a prediction resultqAnd calculating the cross entropy loss:

L(p,t)=-[plog(q)+(1-p)log(1-q)]；

and performing back propagation training of preset model parameters on the detection model by using the cross entropy loss as a loss function.

Based on the further improvement of the method, acquiring the image to be detected, and sending the image to be detected to the detection model after training to acquire the recognition result comprises the following steps: acquiring the image to be detected by using a photographing mode, a video frame taking mode and/or an internet mode; and sending the image to be detected to the trained detection model so that the detection model identifies the image to be detected to obtain an identification result.

In another aspect, an embodiment of the present invention provides a pedestrian traffic detection system, including: the image set generation module is used for acquiring a pre-training image set, a training image set and a verification image set, wherein the pre-training image set and the training image set are amplified in a data amplification mode; the detection model generation module comprises a self-attention module and a support vector machine replacing a full connection layer; the pre-training module is used for pre-training the detection model by using the amplified pre-training image set; the training module is used for carrying out formal training on the pre-trained detection model by using the amplified training image set and then carrying out verification by using the verification image set so as to generate a trained detection model; and the target identification module is used for acquiring an image to be detected and sending the image to be detected to the trained detection model to acquire an identification result.

Compared with the prior art, the invention can realize at least one of the following beneficial effects:

1. pre-training on the pre-training data set and then training using the training data set can significantly reduce the training time required for the self-attention model without degrading its performance. Compared with the traditional training method, the training method is greatly improved;

2. the first feature matrix and the second feature matrix are used for carrying out feature extraction on longitudinal texture and transverse texture of the amplified images in the image set, and a good effect is achieved on subsequent classification judgment;

3. the support vector machine is used for replacing the full-connection layer of the last layer, so that the two-classification performance of the full-connection layer can be greatly improved, the training time and the model scale are further reduced, and the precision is guaranteed.

In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout the drawings;

FIG. 1 is a flow chart of a pedestrian traffic detection method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a detection model according to an embodiment of the invention;

FIG. 3 is a block diagram of a self-attention module according to an embodiment of the present invention;

FIG. 4 is a training result obtained by pre-training a test model using a set of pre-training images, according to an embodiment of the present invention;

FIG. 5 is a diagram of an image in an augmented training image set and a pre-processed image obtained after pre-processing the image, according to an embodiment of the invention;

FIG. 6 is a diagram of actual test results according to an embodiment of the present invention;

fig. 7 is a flowchart of a pedestrian traffic detection method applied to a detection model according to an embodiment of the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.

A specific embodiment of the present invention discloses a pedestrian traffic detection method, as shown in fig. 1, the pedestrian traffic detection method includes: in step S102, a pre-training image set, a training image set, and a verification image set are obtained, wherein the pre-training image set and the training image set are amplified in a data amplification manner; in step S104, constructing a detection model, wherein the detection model comprises a self-attention module and a support vector machine replacing a full connection layer; in step S106, pre-training the detection model using the amplified pre-training image set; in step S108, performing formal training on the pre-trained detection model by using the amplified training image set, and then performing verification by using the verification image set to generate a trained detection model; in step S110, an image to be detected is obtained, and the image to be detected is sent to the trained detection model to obtain a recognition result.

Hereinafter, the respective steps of the pedestrian traffic detection method according to the embodiment of the present invention will be described in detail with reference to fig. 1 to 3.

In step S102, a pre-training image set, a training image set, and a verification image set are obtained, wherein the pre-training image set and the training image set are augmented using a data augmentation method. A pre-training dataset P and a dataset C are acquired and the dataset C is divided into a training set T and a validation set V with a ratio split _ ratio = 0.7. The pre-training image set is a single pedestrian image; and the training image set is a crowd image, wherein the resolution of the training image set is higher than that of the pre-training image set. Specifically, the amplifying the pre-training image set and the training image set by using the data amplification mode comprises the following steps: amplifying the pre-training image set and the training image set by using a mirror image amplification matrix so as to identify the same targets in different angles and different directions in the images; and augmenting the pre-training image set and the training image set by using a scaling augmentation matrix to identify different people in the images and different scenes in the images.

In step S104, a detection model is constructed, wherein the detection model includes a preprocessing module, a self-attention module, and a support vector machine instead of a fully connected layer. The preprocessing module is used for preprocessing the image so as to strengthen the edge characteristics of the target in the image. The self-attention module is used for extracting features of the preprocessed image. The support vector machine is used for carrying out secondary classification on the image features extracted from the attention module.

In step S106, the detection model is pre-trained using the amplified pre-training image set. Pre-training the detection model using the amplified pre-training image set comprises: preprocessing the images in the amplified image set by using a preprocessing module in the pre-training process so as to strengthen the edge characteristics of the targets in the images; using a self-attention module to extract the features of the preprocessed image; and using a support vector machine to perform a second classification on the image features extracted from the attention module (fig. 4 shows a diagram of pre-trained images), wherein the augmented image set is an augmented pre-trained image set.

Specifically, pre-training the detection model using the amplified pre-training image set comprises: performing forward propagation training of preset model parameters on the detection model by using the amplified pre-training image set, and calculating a prediction resultqAnd calculating the cross entropy loss:

L(p,t)=-[plog(q)+(1-p)log(1-q)]；

In step S108, the pre-trained detection model is formally trained using the amplified training image set, and then verified using the verification image set to generate a trained detection model. And in the formal training process, the amplified image set is an amplified training image set. Referring to fig. 2, the formal training of the pre-trained detection model using the amplified training image set includes preprocessing the images in the amplified image set using a preprocessing module to enhance edge features of the target in the images; using a self-attention module to extract the features of the preprocessed image; and carrying out secondary classification on the image features extracted from the attention module by using a support vector machine, wherein the image set amplified in the pre-training process is an amplified pre-training image set, or the image set amplified in the formal training process is an amplified training image set.

Fig. 5 shows an image before preprocessing and an image after preprocessing, in particular, preprocessing an image in an augmented image set using a preprocessing module includes: feature extraction is carried out on longitudinal textures of the images in the amplified image set by using the following first feature matrix:

k1 = [1, 2, 1; 0, 0, 0; -1, -2, -1 ]; and

performing feature extraction on the transverse texture of the images in the amplified image set by using the following second feature matrix:

K2 = [1, 0, -1; 2, 0, -2; 1, 0, -1]。

referring to fig. 3, using the self-attention module, the feature extraction of the preprocessed image includes: the preprocessing the images in the amplified image set using a preprocessing module includes: segmenting an image in the augmented image set into a plurality of patches using a pre-processing module and flattening each patch to generate normalized picture featuresz ₀：

Wherein the content of the first and second substances,x _jadding a binary judgment token (correction token) in front of the first dimension of the picture (A)x _j) To judge whether the pedestrian is a pedestrian or not,

、

、…、

a plurality of small blocks which are respectively divided for each picture, each small block is 16 multiplied by 16,Efor embedding in a matrix, the shape isP ²•C）×DAnd, andE _posa position coding vector, which represents that the position coding is carried out on the divided picture; performing multi-head self-attention calculation on the normalized picture features by using a multi-head self-attention module to obtain feature calculation results; and classifying the feature calculation results by using the multilayer perceptron and extracting the coordinates of the target to be detected from the classified feature calculation results.

The two classification of the image features extracted from the attention module using a support vector machine comprises: mapping the processed picture features into a high-dimensional space by using an RBF core:

detecting a sample in a high-dimensional space to obtain a linear interface by the following formula:

wherein, the first and the second end of the pipe are connected with each other,z _m, z _nto extract image features from the attention module,

for extracted image featuresz _m, z _nThe standard deviation of (a) is determined,ωandbis a coefficient of a linear interface and is,ωis a high dimensional parameter matrix and depends on image characteristics,ω ^Tfor the transposed matrix, ζ is the scaling factor,x ⁱ⁽⁾，y ⁱ⁽⁾for the embedding of samples in a high-dimensional space,Fis a hyper-parameter of the support vector machine.

In step S110, an image to be detected is obtained, and the image to be detected is sent to the trained detection model to obtain a recognition result. Specifically, acquiring an image to be detected, and sending the image to be detected to the trained detection model to acquire a recognition result (refer to fig. 6) includes: acquiring an image to be detected by using a photographing mode, a video frame taking mode and/or an internet mode; and sending the image to be detected to the trained detection model so that the detection model identifies the image to be detected to obtain an identification result.

In another embodiment of the present invention, a pedestrian traffic detection system is disclosed, and referring to fig. 7, the pedestrian traffic detection system includes: an image set generating module 702, configured to obtain a pre-training image set, a training image set, and a verification image set, where the pre-training image set and the training image set are amplified in a data augmentation manner; a detection model generation module 704, including a self-attention module and a support vector machine instead of a fully connected layer; a pre-training module 706, configured to pre-train the detection model using the amplified pre-training image set; a training module 708, configured to perform formal training on the pre-trained detection model using the amplified training image set, and then perform verification using the verification image set to generate a trained detection model; and the target recognition module 710 is configured to obtain an image to be detected, and send the image to be detected to the trained detection model to obtain a recognition result.

Hereinafter, a pedestrian flow rate detection method according to an embodiment of the present invention is described in detail by way of specific examples.

In a first aspect of the embodiments of the present invention, an embodiment of a pedestrian traffic detection method is provided. The method comprises the following steps:

s1, acquiring a pedestrian image set M and a crowd image set C, using the data set M as a pre-training image set, and dividing the training image set T and a verification image set V by split _ ratio =0.7 in the data set C;

s2, building a self-attention model by using a PyTorch depth learning framework;

s3, pre-training the self-attention model by using the pedestrian image set M database to preset model parameters; performing large-scale formal training on the self-attention model by using the crowd image set C database;

s4, obtaining a to-be-detected pedestrian flow picture, sending the to-be-detected pedestrian flow picture to the trained self-attention model, and obtaining a recognition result;

and S5, obtaining the people flow data of the current picture according to the recognition result.

The pedestrian flow detection system provided by the invention improves the application scene of pedestrian flow detection, simplifies the model implementation scheme and reduces the training difficulty.

In some embodiments, step S1, acquiring the pedestrian image set M and the crowd image set C, and generating a training image set according to the pedestrian image set M and the crowd image set C, further includes:

the Pedestrian image set M uses MIT-CBCL Pedestrian Database, is a single Pedestrian image set M, and is in a ppm format and 64 x 128 in resolution; the crowd image set C uses Caltech Peerstrong Detection Benchmark, is a Pedestrian database with larger scale at present, adopts a vehicle-mounted camera to shoot, and has the resolution of 640 multiplied by 480; the training image set is augmented by data augmentation, and since the identification target is a pedestrian image, a mirror image augmentation matrix is selected for the training image set MH ₁Image scaling and amplification matrixH ₂The amplification is carried out, and the amplification is carried out,H ₂and parameters are randomly generated to ensure the robustness of the model when the model identifies pedestrians with different distances and different angles in the image.H ₁、H ₂Are all matrices, which are for imagesAnd transforming the matrix.

And dividing a training image set and a test set by using a train _ test _ split () function in a scinit spare function library, dividing 70% of a data set into the training image set, and dividing the rest 30% of the data set into the test set.

Specifically, there are a total of 924 pictures in the MIT-CBCL pedistrian Database, the shoulder-to-foot distance is about 80 pixels, the Database contains only two front and back sides, and no negative samples. Approximately 250000 frames, 350000 rectangular boxes, 2300 pedestrians are labeled in the Caltech Peerdetection Benchmark.

In some embodiments, step S2, building the self-attention model using a PyTorch depth learning framework, further comprises:

constructing a Transformer network by using PyTorch deep learning. The transform network converts the picturex∈R ^{H W C××}The division is performed, each Patch has a size of 16X 16, and it is expanded to

In whichPIs the size of the Patch and is,Cis the number of the picture channels,Hthe height of the picture is taken as the height of the picture,Win order to be the width of the picture,Nis based onH、W、PThe calculation result of (2). Adding a binary judgment mark in front of the first dimension of the picture (x _j) To judge whether it is a pedestrian, flattening the picture features:

wherein the content of the first and second substances,

、

、…、

a plurality of pictures divided for each picture respectivelySmall blocks, each of 16 x 16,Efor embedding in a matrix, the shape isP ²•C）×D，E _posTo transform the matrix (i.e., encode the position vector), the effect is to encode the split picture into the desired shape. Specifically, when a picture is divided, the picture is simply divided into (P ²×C) The size of the block also needs a coding matrix to process the divided picture and convert the picture into an input dimension required by a network, namely, the coding operation is carried out on the pixel, wherein the coding matrix is usedEAfter multiplying each picture block, a (1) is obtainedD) The vector of dimensions, which is the result of encoding the partial picture block, can be input to the network for processing and analysis. WhereinEIs an Embedding matrix with a shape of: (P ²•C）×D。DIs the Embedding dimension, usually 512 is taken, which represents that the picture Embedding is a 512-dimensional vector. For Normalized (LN) picture features z₀Perform Multi-headed Self-Attention computation with residual network (MSA):

the normalized image features are processed using multi-headed self-attention computation, and the output result is attention-weighted image features. The features described above are input into three fully-connected layers, and the three results are outputQ、K、VFor every two featuresQAndKinner products are made, and the obtained result is used asVThe processing of the image features is corresponding to each image featureVAnd performing weighting processing. It should be noted that the generationQKVThe parameters of the full connection layer can be trained, and the three parameters are trained in the process of training the network, so that the three parameters can well reflect the image characteristics. Then calculating the characteristic to obtain z'₁Classification was performed using a MultiLayer Perceptron (MLP):

after 6-layer MSA and 6-layer MLP are used, MLP is used to extract the coordinates of the object to be detectedR=（x,y,w,h）=MLP(Z ₁)。RIs a detection box, which consists of 4 parameters:xandyto detect the coordinates of the upper left corner of the box,wandhthe width and height of the detection frame, respectively. The multi-layer perceptron is used for classification, and particularly, the full connection layer is used in the network to realize the function of the multi-layer perceptron. The input of the multilayer perceptron is the aforementioned result of multi-head self-attention calculation, and the method comprises the steps of firstly carrying out Normalization (Layer Normalization) operation on input features, then inputting the operation result into the multilayer perceptron (namely, a fully-connected network Layer in a network), and finally adding the output of the network and the input of the multilayer perceptron at the beginning to obtain a final image feature extraction result.

Taking the Vision Transformer network as a feature extractor to extract feature resultsz _lAs input, building a network model of a support vector machine，z _lDimension of (d) is 768. The support vector machine hybrid model uses RBF kernels to extend feature vectors into a high-dimensional space:

detecting the sample in a high-dimensional space, wherein the interface is a linear interface, omega,bAnd when the support vector machine is used for carrying out class judgment, the coefficient of the interface is obtained to be the interface of a high-dimensional space so as to achieve the purpose of classifying the input image characteristics. The coefficients of the interfaces are not one-dimensional, and because the processed picture features are mapped to a high-dimensional space, the coefficients corresponding to the interfaces are also multidimensional, and the specific dimension depends on the image features, and is 768 dimensions here. Omega is 768-dimensional parameter momentArray, needs to be optimized:

wherein ζ_iIs a scaling factor to implement the soft margin SVM. This is the result of a simplification of the support vector machine for the purpose to be optimized.

The cross entropy loss is used as a loss function of network training, and after image features pass through the self-attention network and support vector machine mixed model, the probability that the detection result is the pedestrian is outputqThe actual result ispThen the cross entropy:

L(p,t)=-[plog(q)+(1-p)log(1-q)]，

the model is backpropagated using cross entropy as a loss function to train model parameters.

PyTorch can be regarded as Numpy added with GPU support, and can also be regarded as a powerful deep neural network with an automatic derivation function. In addition to Facebook, it has been adopted by Twitter, CMU, and Salesforce, among other agencies.

Different from the idea of 'sliding weighting' in a convolutional network, an attention mechanism calculates the query, the key and the value of each part in a picture, and the weight of the value of the part is obtained through the matching relationship between the query and the key. And passes the weight to the next layer. In this way, the model can be made to understand the image globally.

The support vector machine method is established on the principle of VC statistics and structure risk minimization, and on the classification problem, the SVM has good robustness, has strong generalization capability on unknown data, and particularly has better performance compared with other traditional machine learning algorithms under the condition of less data volume. And building a Sklear.svm.SCV function library of the scinitlern library, wherein the function library is realized based on libsvm.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program that is stored in a computer-readable storage medium and that, when executed, may include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), or a Random Access Memory (RAM). The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated. It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A pedestrian flow detection method is characterized by comprising the following steps:

acquiring a pre-training image set, a training image set and a verification image set, wherein the pre-training image set and the training image set are amplified in a data amplification mode, the pre-training image set is a single pedestrian image, the training image set is a crowd image, and the resolution of the training image set is higher than that of the pre-training image set;

constructing a detection model, wherein the detection model comprises a self-attention module and a support vector machine replacing a full connection layer of the last layer, and performing secondary classification on the image features extracted from the self-attention module by using the support vector machine;

pre-training the detection model parameters by using the amplified pre-training image set;

performing formal training on the pre-trained detection model by using the amplified training image set, and then performing verification by using the verification image set to generate a trained detection model;

and acquiring an image to be detected, and sending the image to be detected to the detection model after training so as to acquire a recognition result.

2. The pedestrian traffic detection method of claim 1, wherein augmenting the pre-training image set and the training image set using data augmentation comprises:

amplifying the pre-training image set and the training image set by using a mirror image amplification matrix to identify the same target at different angles and different directions in the images; and

and amplifying the pre-training image set and the training image set by using a scaling and amplifying matrix to identify characters with different distances and scenes with different sizes in the images.

3. The pedestrian traffic detection method according to claim 2, wherein the detection model further includes a preprocessing module, wherein,

preprocessing the images in the amplified image set by using the preprocessing module so as to strengthen the edge characteristics of the target in the images;

and performing feature extraction on the preprocessed image by using the self-attention module, wherein the amplified image set is an amplified pre-training image set in the pre-training process, or the amplified image set is an amplified training image set in the formal training process.

4. The pedestrian traffic detection method according to claim 3, wherein preprocessing the images in the amplified image set using the preprocessing module includes:

feature extraction is carried out on longitudinal textures of the images in the amplified image set by using the following first feature matrix:

k1 = [1, 2, 1; 0, 0, 0; -1, -2, -1 ]; and

K2 = [1, 0, -1; 2, 0, -2; 1, 0, -1]。

5. the pedestrian flow detection method according to claim 4, wherein performing feature extraction on the preprocessed image using the self-attention module includes:

preprocessing images in the amplified image set using the preprocessing module includes: segmenting an image in the augmented image set into a plurality of patches using the pre-processing module and flattening each patch to generate normalized picture featuresz ₀：

、

、…、

a plurality of small blocks which are respectively divided for each picture, each small block is 16 multiplied by 16,Efor embedding into a matrix, the shape isP ²•C）×DAnd, andE _posfor coding positionA vector indicating that the divided picture is subjected to position coding;

performing multi-head self-attention calculation on the normalized picture features by using a multi-head self-attention module to obtain feature calculation results; and

and classifying the feature calculation results by using a multilayer perceptron, and extracting the coordinates of the target to be detected from the classified feature calculation results.

6. The pedestrian flow detection method according to claim 4, wherein performing two classifications of the image features extracted from the attention module using the support vector machine comprises:

mapping the processed picture features into a high-dimensional space by using an RBF core:

wherein the content of the first and second substances,z _m, z _nextracting image features for the self-attention module, wherein σ is the extracted image featuresz _m, z _nThe standard deviation of (a) is determined,ωandbis the coefficient of the linear interface in question,ωis a high dimensional parameter matrix and depends on image characteristics,ω ^Tfor the transposed matrix, ζ is the scaling factor,x ⁱ⁽⁾，y ⁱ⁽⁾for the embedding of samples in the high-dimensional space,Fis a hyper-parameter of the support vector machine.

7. The pedestrian traffic detection method according to claim 2, wherein pre-training the detection model using the amplified pre-training image set comprises:

performing forward propagation training of preset model parameters on the detection model by using the amplified pre-training image set, and calculating a prediction resultqAnd calculating the cross entropy loss

L(p,t)=-[plog(q)+(1-p)log(1-q)]；

8. The pedestrian flow detection method according to claim 1, wherein obtaining an image to be detected and sending the image to be detected to the trained detection model to obtain a recognition result comprises:

acquiring the image to be detected by using a photographing mode, a video frame taking mode and/or an internet mode;

and sending the image to be detected to the trained detection model so that the detection model identifies the image to be detected to obtain an identification result.

9. A pedestrian flow detection system, comprising:

the image set generation module is used for acquiring a pre-training image set, a training image set and a verification image set, wherein the pre-training image set and the training image set are amplified in a data augmentation mode, the pre-training image set is a single pedestrian image, the training image set is a crowd image, and the resolution of the training image set is higher than that of the pre-training image set;

the detection model generation module comprises a self-attention module and a support vector machine for replacing a full connection layer of the last layer, and the support vector machine is used for carrying out secondary classification on the image features extracted from the self-attention module;

the pre-training module is used for pre-training the detection model parameters by using the amplified pre-training image set;

the training module is used for carrying out formal training on the pre-trained detection model by using the amplified training image set and then carrying out verification by using the verification image set so as to generate a trained detection model;

and the target identification module is used for acquiring an image to be detected and sending the image to be detected to the trained detection model to acquire an identification result.