CN110414338B - Pedestrian re-identification method based on sparse attention network - Google Patents

Pedestrian re-identification method based on sparse attention network Download PDF

Info

Publication number
CN110414338B
CN110414338B CN201910543465.4A CN201910543465A CN110414338B CN 110414338 B CN110414338 B CN 110414338B CN 201910543465 A CN201910543465 A CN 201910543465A CN 110414338 B CN110414338 B CN 110414338B
Authority
CN
China
Prior art keywords
layer
image
pedestrian
convolution
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910543465.4A
Other languages
Chinese (zh)
Other versions
CN110414338A (en
Inventor
张灿龙
解盛
李志欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanzhida Technology Co ltd
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201910543465.4A priority Critical patent/CN110414338B/en
Publication of CN110414338A publication Critical patent/CN110414338A/en
Application granted granted Critical
Publication of CN110414338B publication Critical patent/CN110414338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2136Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on sparsity criteria, e.g. with an overcomplete basis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a pedestrian re-identification method based on a sparse attention network, which comprises the steps of firstly transmitting shallow features to deep features in a lossless manner through short connection; then extracting main volumes and characteristics of the images through a main residual error network consisting of residual error modules which are continuously overlapped; then, extracting the detail features of the image which are easy to lose through a normalized compression-excitation module embedded in the trunk residual error network; and finally multiplying the obtained characteristics, adding the characteristics obtained by the first part, and conveying the characteristics into a full connection layer and a classification regression layer to obtain a classification and regression result. The sparse attention network of the present invention can effectively extract pedestrian photo detail features of a plurality of pedestrian re-identification data sets.

Description

Pedestrian re-identification method based on sparse attention network
Technical Field
The invention relates to the technical field of computer vision, in particular to a pedestrian re-identification method based on a sparse attention network.
Background
The pedestrian re-identification means that the identity of the same pedestrian in different monitoring scenes is re-confirmed so as to make up the visual limitation of a single camera. The pedestrian re-identification can be widely applied to the fields of intelligent image understanding, intelligent video analysis, intelligent video investigation and the like. At present, methods applied to pedestrian re-identification mainly include: pedestrian re-identification based on artificial design features and pedestrian re-identification based on a deep convolutional neural network. The pedestrian re-identification based on the artificial design features mainly comprises two parts of artificial design feature extraction and feature similarity measurement; the pedestrian re-identification model based on the deep convolutional neural network can integrate two links of feature expression and similarity measurement together, and the performance far beyond that of the traditional method is obtained through the combined optimization of the two links. With the rapid development of deep learning technology and the recent appearance of large-scale pedestrian re-identification data sets, deep pedestrian re-identification technology is rapidly developed, and becomes the mainstream method in the current pedestrian re-identification field.
The paper "Deep Residual Learning for Image Recognition" (published in Conference on Computer Vision and Pattern Recognition Conference) proposes Residual Learning on the basis of a classical convolutional neural network, so that the convolutional neural network becomes deeper and has better Recognition effect, and a new era of widely used Deep Learning in various fields is opened. The paper "Show, attentive and tele:" Neural Image capture Generation with Visual Attention mechanism "(published in International Conference on Machine Learning), applies the Attention mechanism in the natural language processing field to intelligent Image processing, and makes a leap improvement in the Image description equi-directional direction, opening a new era of the application of the Attention mechanism to intelligent Image processing. The paper "Squeeze-and-Excitation Networks" (published in Conference on Computer Vision and Pattern Recognition Conference) adds an attention module (compression-Excitation module) on the basis of the residual network, so that the model can extract more detailed features of the picture and improve the accuracy. The paper "Beyond Part Models: Person Retrieval with referred Part firing (model for region concentration: pedestrian search using Refined region Pooling)" (published in "European Conference on Computer Vision") proposes to divide the pedestrian picture level into six parts evenly on the basis of the residual network, which can make the model focus more on the details, thereby improving the classification accuracy.
It is obvious that most of the existing pedestrian re-identification methods take a residual error network as a basic framework, and improve the pedestrian classification accuracy by improving the structure of the residual error network, but the improved methods do not utilize the advantage that the attention mechanism is good at focusing on details to improve the model, so that a large number of effective features are easily lost when the model extracts image features. Therefore, it is necessary to invent a method that can make the model extract more detailed features of the image during deep learning.
Disclosure of Invention
The invention aims to solve the problem that a great amount of effective features are lost when the existing pedestrian re-identification method is used for deep learning, and provides a pedestrian re-identification method based on a sparse attention network.
In order to solve the problems, the invention is realized by the following technical scheme:
the pedestrian re-identification method based on the sparse attention network comprises the following steps:
step 1, dividing the images in the known pedestrian re-identification data set into a training set and a testing set, and respectively preprocessing the images in the training set and the testing set;
step 2, copying all training images in the training set obtained in the step 1 to respectively obtain an original training image and a copied training image;
step 3, as for the original training image obtained in the step 2, firstly, sending the original training image into a convolutional layer to extract convolution characteristics of the image, then sending the extracted convolution characteristics into a maximum pooling layer to extract maximum pooling characteristics of the image, and then sending the extracted maximum pooling characteristics into 3 first residual error modules which are repeatedly superposed to extract first residual error convolution characteristics of the image;
step 4, sending the first residual convolution characteristics obtained in the step 3 into a first normalization compression-excitation module to extract first attention characteristics of the image;
step 5, multiplying the first residual convolution characteristic obtained in the step 3 and the first attention characteristic obtained in the step 4 to obtain a first sparse attention characteristic;
step 6, adding the copied training image obtained in the step 2 and the first sparse attention feature obtained in the step 5 to obtain a first-stage image feature;
step 7, copying all the first-stage image characteristics obtained in the step 6 to respectively obtain the original first-stage image characteristics and copied first-stage image characteristics;
step 8, sending the original first-stage image feature obtained in the step 7 into 4 second residual error modules which are repeatedly superposed to extract a second residual error convolution feature of the image;
step 9, sending the second residual convolution characteristics obtained in the step 8 into a second normalized compression-excitation module to extract second attention characteristics of the image;
step 10, multiplying the second residual convolution characteristic obtained in the step 8 by the second attention characteristic obtained in the step 9 to obtain a second sparse attention characteristic;
step 11, adding the copied first-stage image characteristics obtained in the step 7 and the second sparse attention characteristics obtained in the step 10 to obtain second-stage image characteristics;
step 12, copying all the second-stage image characteristics obtained in the step 11 to respectively obtain the original second-stage image characteristics and copied second-stage image characteristics;
step 13, sending the original second-stage image feature obtained in the step 12 into 6 repeatedly-superposed third residual error modules to extract a third residual error convolution feature of the image;
step 14, sending the third residual convolution characteristic obtained in the step 13 to a third normalized compression-excitation module to extract a third attention characteristic of the image;
step 15, multiplying the third residual convolution characteristic obtained in the step 13 by the third attention characteristic obtained in the step 14 to obtain a third sparse attention characteristic;
step 16, adding the copied second-stage image features obtained in the step 12 and the third sparse attention features obtained in the step 15 to obtain third-stage image features;
step 17, copying all the third-stage image characteristics obtained in the step 16 to respectively obtain the original third-stage image characteristics and the copied third-stage image characteristics;
step 18, sending the original third-stage image feature obtained in step 17 into 3 fourth residual error modules which are repeatedly superposed to extract a fourth residual error convolution feature of the image;
step 19, sending the fourth residual convolution characteristic obtained in the step 18 into a fourth normalized compression-excitation module to extract a fourth attention characteristic of the image;
step 20, multiplying the fourth residual convolution characteristic obtained in the step 18 and the fourth attention characteristic obtained in the step 19 to obtain a fourth sparse attention characteristic;
step 21, adding the copied third-stage image feature obtained in the step 17 and the fourth sparse attention feature obtained in the step 20 to obtain a fourth-stage image feature;
step 22, sending all the fourth-stage image features obtained in the step 21 into an average pooling layer to extract average pooling features of the images;
step 23, sending all the average pooling characteristics obtained in the step 22 into a classification layer, thereby obtaining a pedestrian category prediction model;
step 24, testing the pedestrian category prediction model obtained in the step 23 by using all the test images in the test set obtained in the step 2, thereby obtaining a final pedestrian category prediction model;
and 25, screening all pedestrian images from the video acquired in real time, sending all the pedestrian images into a final pedestrian category prediction model for identification and classification, and finding out all the pedestrian images of the specified object.
In the step 1, the pedestrian re-identification data sets are Market-1501 and DukeMTMC-reiD.
In step 1, the preprocessing processes of the training images in the training set and the test images in the test set are respectively as follows: the preprocessing process of the training images in the training set comprises the following steps: firstly, cutting a training image, horizontally turning the cut image, and then normalizing the turned training image; the preprocessing process of the test images in the test set comprises the following steps: and cutting the test image.
In the scheme, the first residual error module, the second residual error module, the third residual error module and the fourth residual error module have the same structure and respectively comprise 3 convolutional layers and 1 short connection; wherein the first layer of convolutional layers has C/4 filters with step size of 1 and kernel size of 1 × 1, the second layer of convolutional layers has C/4 filters with step size of 1 and kernel size of 3 × 3, and the third layer of convolutional layers has C filters with step size of 1 and kernel size of 1 × 1; the head of the first layer of convolution layer and the tail of the third layer of convolution layer are connected in a short-circuit mode, and the output of the whole residual error module is obtained after the input of the first layer of convolution layer and the output of the third layer of convolution layer are added; the channel value C of the first residual error module is 256, the channel value C of the second residual error module is 512, the channel value C of the third residual error module is 1024, and the channel value C of the fourth residual error module is 2048.
In the above solution, the first normalized compression-excitation module, the second normalized compression-excitation module, the third normalized compression-excitation module, and the fourth normalized compression-excitation module have the same structure, and each of them includes 7 layers: wherein the first layer is an average pooling layer; the second layer is a dimensionality reduction layer with C/16 filters of step size 1 and kernel size 1 × 1; the third layer is a batch normalization layer which executes C/16 normalization operations; the fourth layer is a linear rectifying layer; the fifth layer is a dimension-up layer with C filters with step size 1 and kernel size 1 × 1; the sixth layer is a batch normalization layer which executes C normalization operations; the seventh layer is a Sigmoid activation layer;
the channel value C of the first normalized compression-excitation module is 256, the channel value C of the second normalized compression-excitation module is 512, the channel value C of the third normalized compression-excitation module is 1024, and the channel value C of the fourth normalized compression-excitation module is 2048.
In the above scheme, the linear rectification function executed by the fourth layer, i.e. the linear rectification layer, is:
Figure BDA0002103272160000041
where x is the input feature of the fourth layer.
In the above scheme, the Sigmoid activation function executed by the seventh layer, i.e. the Sigmoid activation layer, is:
Figure BDA0002103272160000042
where z is the input feature of the seventh layer.
Compared with the prior art, the invention combines various advanced network structures and designs a sparse attention mechanism on the basis of the advanced network structures, thereby having the following characteristics:
(1) by using a sparse normalized compression-excitation network, namely adding a small amount of attention modules in a residual network structure, the sparse attention mechanism can effectively avoid the loss of necessary information of the characteristic diagram in the convolution process.
(2) A sparse attention mechanism is provided, namely a small amount of attention modules or other modules which can be used for extracting features are added in a deep network model, so that the model can keep the previous feature extraction capability under the condition that the complexity of the model is unchanged, and meanwhile, the model has the capability of focusing on extracting effective information which is discarded when a feature map is reduced.
(3) The normalization of the compression-excitation module results in a normalized compression-excitation module that enables more features to be activated by the activation function relative to previous attention feature extraction modules.
Drawings
Fig. 1 is a schematic structural diagram of a pedestrian re-identification model (sparse normalized compression-excitation network) according to the present invention.
Fig. 2 is a schematic structural diagram of a residual error module.
Fig. 3 is a schematic diagram of the structure of a normalized compression-excitation module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.
The pedestrian re-identification model constructed by the invention is a sparse normalized compression-excitation network, as shown in fig. 1, and mainly comprises a trunk layer positioned in the middle, 4 short connections positioned on one side of the trunk layer, and 4 normalized compression-excitation modules positioned on the other side of the trunk layer.
(1) Main dry layer:
the first layer of convolution layer, which is composed of a filter with a kernel size of 7 × 7, acts as dimension reduction, and the picture becomes 1/4 of the original picture size after dimension reduction, so this layer is mainly to reduce the amount of calculation.
The second layer is the maximum pooling layer, i.e., taking the maximum value in the 2 × 2 pixel region, again to reduce the amount of model computation.
The third layer to the sixteenth layer are respectively a main network formed by sequentially overlapping 3 first residual modules (ResNet module 1), 4 second residual modules (ResNet module 2), 6 third residual modules (ResNet module 3) and 3 fourth residual modules (ResNet module 4), the main structures of the first residual modules to the fourth residual modules are the same, and the only difference is that the number of characteristic pictures for inputting and outputting the residual modules is different, namely the C (channel) value is different. A pedestrian picture is input into the deep convolutional neural network, and the output features are mainly extracted from the backbone network.
Referring to fig. 2, residual modules (ResNet module) are used to extract image main features, each consisting mainly of short connections and 3 convolutional layers. The first convolution layer is provided with C/4 filters (filters) with step length of 1 and kernel size of 1 multiplied by 1 to carry out convolution operation and extract convolution characteristics of the image; the second convolution layer is provided with C/4 filters (filters) with the step length of 1 and the kernel size of 3 multiplied by 3 for convolution operation to extract the convolution characteristics of the image; the third convolution layer has C filters (filters) with step size 1 and kernel size 1 × 1, and performs convolution operation to extract convolution features of the image. In the residual error module, a short connection is also arranged for connecting the head of the first convolution layer and the tail of the third convolution layer, namely in the residual error module, the image characteristics of the input residual error module reach the tail of the third convolution layer through two operation paths of short connection and three-layer convolution operation at the same time, and then the values of the two paths are added to obtain the output of the residual error module.
The first residual error module, the second residual error module, the third residual error module and the fourth residual error module have the same structure, and the difference is as follows: the channel value C of the first residual module is 256, the channel value C of the second residual module is 512, the channel value C of the third residual module is 1024, and the channel value C of the fourth residual module is 2048.
Each residual module contains three convolutional layers: the first convolution layer is provided with C/4 filters (filters) with step length of 1 and kernel size of 1 multiplied by 1 to carry out convolution operation and extract convolution characteristics of the image; the second convolution layer is provided with C/4 filters (filters) with the step length of 1 and the kernel size of 3 multiplied by 3 for convolution operation to extract the convolution characteristics of the image; the third convolution layer has C filters (filters) with step size 1 and kernel size 1 × 1, and performs convolution operation to extract convolution features of the image. In addition, each residual module also has a short connection connecting the head of the first convolution layer and the tail of the third convolution layer, namely: the input of the residual error module reaches the tail part of the third convolution layer through two operation paths of short connection and three-layer convolution operation, and the values of the two paths are added to obtain the output of the residual error module.
The seventeenth layer is an average pooling layer, and the function of the average pooling layer is to unify all the dispersed classification values into an array, so that the classification function of the next layer can be conveniently classified. The eighteenth layer is a fully-connected layer that uses a Softmax function to probabilistically predict and classify 751 values.
(2) Short connection:
the short connection can transmit the pictures of the network shallow layer into the deep layer without loss, so that the information loss in the convolution process can be reduced.
(3) Normalized compression-excitation module:
the normalized compression-excitation module (NSE module) is an attention module for extracting image detail features. Different from a compression-excitation module in a compression-excitation network, the normalization compression-excitation module adds normalization operation on the basis of the compression-excitation module, so that more effective features can pass through an activation function, and a model can extract more effective features. Specifically, a batch normalization layer is added after a dimensionality reduction full-connection layer and a dimensionality increasing full-connection layer in a compression-excitation module, all photos in training are normalized to be 0 in the mean value of all pixel values of each photo, and 1 in the variance.
Referring to fig. 3, the first normalized compression-excitation module, the second normalized compression-excitation module, the third normalized compression-excitation module, and the fourth normalized compression-excitation module are identical in structure, except that: the channel value C of the first normalized compression-excitation module is 256, the channel value C of the second normalized compression-excitation module is 512, the channel value C of the third normalized compression-excitation module is 1024, and the channel value C of the fourth normalized compression-excitation module is 2048.
Each normalized compression-excitation module contains seven layers of operations: the first layer is an average pooling layer, namely, the pixel values of each image in the C images are averaged; the second layer is a dimensionality reduction layer, and C images obtained by the previous layer are reduced into C/16 images by C/16 filters (filters) with the step size of 1 and the kernel size of 1 multiplied by 1; the third layer is a batch normalization layer which executes C/16 normalization operations;the fourth layer is a Linear rectification function (ReLU) with the calculation formula of
Figure BDA0002103272160000061
Wherein x is the input feature of the fourth layer; the fifth layer is a dimensionality-increasing layer, and C/16 images obtained by the previous layer are subjected to dimensionality-increasing into C images by C filters (filters) with the step size of 1 and the kernel size of 1 multiplied by 1; the sixth layer is a batch normalization layer which executes C normalization operations; the seventh layer is a Sigmoid activation function, and the calculation formula is
Figure BDA0002103272160000062
Where z is the input feature of the seventh layer.
Compared with the past model using dozens of attention modules, the pedestrian re-identification model does not need to respectively superpose dozens of attention modules behind each residual module, only when the C value in the residual module changes, four improved attention modules (normalized compression-excitation modules) are sparsely used behind the residual module, image detail features can be more effectively extracted, finally, the features extracted by the attention modules are multiplied by the features extracted by the residual module, a shallow lossless feature map transmitted by short connection is added, and the residual module with the changed C value in the next layer is input.
A pedestrian re-identification method based on a sparse attention network comprises the following specific steps:
processing a given set of pedestrian re-identification data:
(1) carrying out image preprocessing on the large pedestrian re-identification data sets Market-1501 and DukeMTMC-reiD:
(1.1) enlarge their image sizes to 288 × 144 pixels in total.
(1.2) the whole data set picture is divided into training set and testing set according to 7: 3. Cutting the photos of the training set into 256 multiplied by 128 pixels, horizontally turning, and finally normalizing the photos of the pedestrians into that the mean value of all pixel values of each photo is 0 and the variance is 1; the photo size of the test set was enlarged to 256 x 128 pixels without the rest of the processing.
Secondly, training the constructed pedestrian re-identification model, namely the sparse normalized compression-excitation network by utilizing a training set to obtain a prediction model of the pedestrian category:
roughly divided into four parts: the first part is the lossless transmission of shallow features to deep features through short connections; the second part is to extract the main volume and the characteristics of the image through a main residual error network consisting of residual error modules which are continuously overlapped; the third part is to extract the detail features of the image which are easy to be lost through a sparse attention module (a normalized compression-excitation module) embedded in the trunk residual error network; and the fourth part is to multiply the characteristics obtained by the second part and the third part, finally add the characteristics obtained by the first part, and convey the characteristics into a full connection layer and a classification regression layer to obtain classification and regression results. The sparse attention network of the present invention can effectively extract pedestrian photo detail features of a plurality of pedestrian re-identification data sets.
(2) A process of residual error feature extraction for training images in the training set, namely:
the first stage is as follows:
(2.1) copying the input image into two identical photos, wherein the first one of the two identical photos is subjected to convolution operation through a filter (filter) with C-64 kernel sizes of 7 × 7 to extract convolution characteristics of the image, and the step length of the convolution is 2, namely, the convolution operation is performed once every other pixel point.
(2.2) sending the convolution layer characteristics obtained in the step (2.1) into a layer of filter (filter) with C being 64 kernels and the size being 2 x 2 to carry out maximum pooling operation (the pixel with the maximum value among 4 pixels) to extract image characteristics, wherein the step length of the pooling operation is 1, namely, pooling operation is carried out on each pixel.
And (2.3) sending the image features obtained in the step (2.2) into three first residual error modules which are repeatedly superposed for feature extraction, wherein the channel value C of the first residual error module is 256.
(3) And (4) sending the residual convolution characteristics obtained in the step (2.3) to a normalized compression-excitation module for attention characteristic extraction, wherein the channel value C of the first normalized compression-excitation module is 256.
(4) And (3) multiplying the residual convolution characteristics obtained in the step (2) and the attention convolution characteristics obtained in the step (3) to obtain sparse attention characteristics.
(5) And (4) adding the second image obtained by copying in the step (2.1) and the sparse attention feature obtained in the step (4) to obtain a first-stage image feature.
And a second stage:
(6) and (4) conveying the first-stage image features obtained in the step (5) into an attention feature extraction module for second-stage sparse feature extraction, namely repeating the steps (2) to (5) to obtain second-stage image features. In the second phase, the channel value C of the second residual block and the second normalized compression-excitation block is 512.
And a third stage:
(7) and (5) conveying the second-stage image features obtained in the step (6) into a third-stage sparse attention feature extraction module, namely repeating the steps (2) to (5) to obtain third-stage image features. In the third stage, the channel values C of the third residual block and the third normalized compression-excitation block are 1024.
A fourth stage:
(8) and (5) conveying the third-stage image features obtained in the step (7) into a fourth-stage sparse attention feature extraction module, namely repeating the steps (2) to (5) to obtain fourth-stage image features.
In the fourth stage, the channel value C of the third residual module and the third normalized compression-excitation module is 2048.
The fifth stage:
(9) and (4) conveying the fourth-stage image features obtained in the step (8) into an averaging pooling layer, and averaging the pixel values of each of 2048 images.
(10) And (4) conveying a classification layer by using the average pooled features obtained in the step (9), and converting 2048 features into 751 probability values with the value range of 0-1 and the sum of 100% by using a classifier Softmax function, wherein the index corresponding to the highest probability value is the prediction model of the pedestrian category.
The calculation formula of the Softmax function is as follows:
Figure BDA0002103272160000081
wherein Vi is the output of the preceding stage output unit of the classifier, i represents the indexes of C classes, Si represents the ratio of the index of the current element to the sum of the indexes of all elements, and Softmax converts the output value of the pedestrian of class C into relative probability for easier understanding and comparison, and the value of C is 751.
(III) testing the pedestrian category prediction model by using the test set to obtain a final pedestrian category prediction model:
the testing set tests the tested prediction model of the pedestrian category to verify the training effect and performance of the model.
And (IV) carrying out pedestrian re-identification by utilizing a prediction model of the final pedestrian category:
screening all pedestrian images from the video acquired in real time, sending all the pedestrian images into a final pedestrian category prediction model for identification and classification, and finding out all the pedestrian images of the specified object so as to finish pedestrian re-identification.
It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. The sparse concept as invented includes a sparse attention module and a sparse short connection. For sparse attention modules, the invention is not limited to adding only four or one attention module in a model, but also includes one attention module, two attention modules, three attention modules, and four attention modules that are sparsely added in various orders in a model. For sparse short connections, not only are four or one short connection added to the model, but also one short connection, two short connections, three short connections, and four short connections sparsely added to the model in various orders. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims (7)

1. The pedestrian re-identification method based on the sparse attention network is characterized by comprising the following steps of:
step 1, dividing the images in the known pedestrian re-identification data set into a training set and a testing set, and respectively preprocessing the images in the training set and the testing set;
step 2, copying all training images in the training set obtained in the step 1 to respectively obtain an original training image and a copied training image;
step 3, as for the original training image obtained in the step 2, firstly, sending the original training image into a convolutional layer to extract convolution characteristics of the image, then sending the extracted convolution characteristics into a maximum pooling layer to extract maximum pooling characteristics of the image, and then sending the extracted maximum pooling characteristics into 3 first residual error modules which are repeatedly superposed to extract first residual error convolution characteristics of the image;
step 4, sending the first residual convolution characteristics obtained in the step 3 into a first normalization compression-excitation module to extract first attention characteristics of the image;
step 5, multiplying the first residual convolution characteristic obtained in the step 3 and the first attention characteristic obtained in the step 4 to obtain a first sparse attention characteristic;
step 6, adding the copy training image obtained in the step 2 and the first sparse attention feature obtained in the step 5 to obtain a first-stage image feature;
step 7, copying all the first-stage image characteristics obtained in the step 6 to respectively obtain the original first-stage image characteristics and copied first-stage image characteristics;
step 8, sending the original first-stage image feature obtained in the step 7 into 4 second residual error modules which are repeatedly superposed to extract a second residual error convolution feature of the image;
step 9, sending the second residual convolution characteristics obtained in the step 8 into a second normalized compression-excitation module to extract second attention characteristics of the image;
step 10, multiplying the second residual convolution characteristic obtained in the step 8 by the second attention characteristic obtained in the step 9 to obtain a second sparse attention characteristic;
step 11, adding the copied first-stage image characteristics obtained in the step 7 and the second sparse attention characteristics obtained in the step 10 to obtain second-stage image characteristics;
step 12, copying all the second-stage image characteristics obtained in the step 11 to respectively obtain the original second-stage image characteristics and copied second-stage image characteristics;
step 13, sending the original second-stage image feature obtained in the step 12 into 6 repeatedly-superposed third residual error modules to extract a third residual error convolution feature of the image;
step 14, sending the third residual convolution characteristic obtained in the step 13 to a third normalized compression-excitation module to extract a third attention characteristic of the image;
step 15, multiplying the third residual convolution characteristic obtained in the step 13 by the third attention characteristic obtained in the step 14 to obtain a third sparse attention characteristic;
step 16, adding the copied second-stage image features obtained in the step 12 and the third sparse attention features obtained in the step 15 to obtain third-stage image features;
step 17, copying all the third-stage image characteristics obtained in the step 16 to respectively obtain the original third-stage image characteristics and the copied third-stage image characteristics;
step 18, sending the original third-stage image feature obtained in step 17 into 3 fourth residual error modules which are repeatedly superposed to extract a fourth residual error convolution feature of the image;
step 19, sending the fourth residual convolution characteristic obtained in the step 18 into a fourth normalized compression-excitation module to extract a fourth attention characteristic of the image;
step 20, multiplying the fourth residual convolution characteristic obtained in the step 18 and the fourth attention characteristic obtained in the step 19 to obtain a fourth sparse attention characteristic;
step 21, adding the copied third-stage image feature obtained in the step 17 and the fourth sparse attention feature obtained in the step 20 to obtain a fourth-stage image feature;
step 22, sending all the fourth-stage image features obtained in the step 21 into an average pooling layer to extract average pooling features of the images;
step 23, sending all the average pooling characteristics obtained in the step 22 into a classification layer, thereby obtaining a pedestrian category prediction model;
step 24, testing the pedestrian category prediction model obtained in the step 23 by using all the test images in the test set obtained in the step 2, thereby obtaining a final pedestrian category prediction model;
and 25, screening all pedestrian images from the video acquired in real time, sending all the pedestrian images into a final pedestrian category prediction model for identification and classification, and finding out all the pedestrian images of the specified object.
2. The sparse attention network-based pedestrian re-identification method of claim 1, wherein in step 1, the pedestrian re-identification data sets are Market-1501 and DukeMTMC-reID.
3. The pedestrian re-identification method based on the sparse attention network as claimed in claim 1, wherein in step 1, the pre-processing of the training images in the training set and the test images in the test set are respectively as follows:
the preprocessing process of the training images in the training set comprises the following steps: firstly, cutting a training image, horizontally turning the cut image, and then normalizing the turned training image;
the preprocessing process of the test images in the test set comprises the following steps: and cutting the test image.
4. The pedestrian re-identification method based on the sparse attention network as claimed in claim 1, wherein the first residual module, the second residual module, the third residual module and the fourth residual module have the same structure and comprise 3 convolutional layers and 1 short connection; wherein the first layer of convolutional layers has C/4 filters with step size of 1 and kernel size of 1 × 1, the second layer of convolutional layers has C/4 filters with step size of 1 and kernel size of 3 × 3, and the third layer of convolutional layers has C filters with step size of 1 and kernel size of 1 × 1; the head of the first layer of convolution layer and the tail of the third layer of convolution layer are connected in a short-circuit mode, and the output of the whole residual error module is obtained after the input of the first layer of convolution layer and the output of the third layer of convolution layer are added;
the channel value C of the first residual error module is 256, the channel value C of the second residual error module is 512, the channel value C of the third residual error module is 1024, and the channel value C of the fourth residual error module is 2048.
5. The sparse attention network-based pedestrian re-identification method of claim 1, wherein the first normalized compression-excitation module, the second normalized compression-excitation module, the third normalized compression-excitation module and the fourth normalized compression-excitation module have the same structure and comprise 7 layers: wherein the first layer is an average pooling layer; the second layer is a dimensionality reduction layer with C/16 filters of step size 1 and kernel size 1 × 1; the third layer is a batch normalization layer which executes C/16 normalization operations; the fourth layer is a linear rectifying layer; the fifth layer is a dimension-up layer with C filters with step size 1 and kernel size 1 × 1; the sixth layer is a batch normalization layer which executes C normalization operations; the seventh layer is a Sigmoid activation layer;
the channel value C of the first normalized compression-excitation module is 256, the channel value C of the second normalized compression-excitation module is 512, the channel value C of the third normalized compression-excitation module is 1024, and the channel value C of the fourth normalized compression-excitation module is 2048.
6. The sparse attention network-based pedestrian re-identification method of claim 5, wherein the fourth layer, the linear rectification layer, performs a linear rectification function as follows:
Figure FDA0002103272150000031
where x is the input feature of the fourth layer.
7. The sparse attention network-based pedestrian re-identification method of claim 5, wherein the Sigmoid activation function executed by the seventh, Sigmoid activation layer is:
Figure FDA0002103272150000032
where z is the input feature of the seventh layer.
CN201910543465.4A 2019-06-21 2019-06-21 Pedestrian re-identification method based on sparse attention network Active CN110414338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910543465.4A CN110414338B (en) 2019-06-21 2019-06-21 Pedestrian re-identification method based on sparse attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910543465.4A CN110414338B (en) 2019-06-21 2019-06-21 Pedestrian re-identification method based on sparse attention network

Publications (2)

Publication Number Publication Date
CN110414338A CN110414338A (en) 2019-11-05
CN110414338B true CN110414338B (en) 2022-03-15

Family

ID=68359592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910543465.4A Active CN110414338B (en) 2019-06-21 2019-06-21 Pedestrian re-identification method based on sparse attention network

Country Status (1)

Country Link
CN (1) CN110414338B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161224A (en) * 2019-12-17 2020-05-15 沈阳铸造研究所有限公司 Casting internal defect grading evaluation system and method based on deep learning
CN111325161B (en) * 2020-02-25 2023-04-18 四川翼飞视科技有限公司 Method for constructing human face detection neural network based on attention mechanism
CN112016434A (en) * 2020-08-25 2020-12-01 安徽索贝数码科技有限公司 Lens motion identification method based on attention mechanism 3D residual error network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9336436B1 (en) * 2013-09-30 2016-05-10 Google Inc. Methods and systems for pedestrian avoidance
CN105938544A (en) * 2016-04-05 2016-09-14 大连理工大学 Behavior identification method based on integrated linear classifier and analytic dictionary
WO2017201638A1 (en) * 2016-05-23 2017-11-30 Intel Corporation Human detection in high density crowds
CN107610154A (en) * 2017-10-12 2018-01-19 广西师范大学 The spatial histogram of multi-source target represents and tracking
CN108010051A (en) * 2017-11-29 2018-05-08 广西师范大学 Multisource video subject fusion tracking based on AdaBoost algorithms
CN109583502A (en) * 2018-11-30 2019-04-05 天津师范大学 A kind of pedestrian's recognition methods again based on confrontation erasing attention mechanism
CN109800710A (en) * 2019-01-18 2019-05-24 北京交通大学 Pedestrian's weight identifying system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9773163B2 (en) * 2013-11-14 2017-09-26 Click-It, Inc. Entertainment device safety system and related methods of use
JP6688990B2 (en) * 2016-04-28 2020-04-28 パナソニックIpマネジメント株式会社 Identification device, identification method, identification program, and recording medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9336436B1 (en) * 2013-09-30 2016-05-10 Google Inc. Methods and systems for pedestrian avoidance
CN105938544A (en) * 2016-04-05 2016-09-14 大连理工大学 Behavior identification method based on integrated linear classifier and analytic dictionary
WO2017201638A1 (en) * 2016-05-23 2017-11-30 Intel Corporation Human detection in high density crowds
CN107610154A (en) * 2017-10-12 2018-01-19 广西师范大学 The spatial histogram of multi-source target represents and tracking
CN108010051A (en) * 2017-11-29 2018-05-08 广西师范大学 Multisource video subject fusion tracking based on AdaBoost algorithms
CN109583502A (en) * 2018-11-30 2019-04-05 天津师范大学 A kind of pedestrian's recognition methods again based on confrontation erasing attention mechanism
CN109800710A (en) * 2019-01-18 2019-05-24 北京交通大学 Pedestrian's weight identifying system and method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Attention-Aware Compositional Network for Person Re-identification;Jing Xu 等;《Computer Vision and Pattern Recognition》;20180516;2119-2128 *
CA3Net: Contextual-Attentional Attribute-Appearance Network for Person Re-Identification;Jiawei Liu 等;《Computer Vision and Pattern Recognition》;20181119;1-9 *
Person Re-identification with Cascaded Pairwise Convolutions;Yicheng Wang 等;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20181217;1470-1478 *
Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification;Lin Wu 等;《Computer Vision and Pattern Recognition》;20181014;1412-1424 *
基于注意力卷积模块的深度神经网络图像识别;袁嘉杰;《计算机工程与应用》;20190130;1-13 *

Also Published As

Publication number Publication date
CN110414338A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN112308158B (en) Multi-source field self-adaptive model and method based on partial feature alignment
CN110263705B (en) Two-stage high-resolution remote sensing image change detection system oriented to remote sensing technical field
CN108960141B (en) Pedestrian re-identification method based on enhanced deep convolutional neural network
CN108256482B (en) Face age estimation method for distributed learning based on convolutional neural network
CN109447977B (en) Visual defect detection method based on multispectral deep convolutional neural network
CN110414338B (en) Pedestrian re-identification method based on sparse attention network
CN113052210A (en) Fast low-illumination target detection method based on convolutional neural network
CN109410184B (en) Live broadcast pornographic image detection method based on dense confrontation network semi-supervised learning
CN111368754B (en) Airport runway foreign matter detection method based on global context information
CN111325165A (en) Urban remote sensing image scene classification method considering spatial relationship information
CN112750129B (en) Image semantic segmentation model based on feature enhancement position attention mechanism
CN108090472A (en) Pedestrian based on multichannel uniformity feature recognition methods and its system again
CN113902622B (en) Spectrum super-resolution method based on depth priori joint attention
CN115965864A (en) Lightweight attention mechanism network for crop disease identification
CN114187308A (en) HRNet self-distillation target segmentation method based on multi-scale pooling pyramid
CN114882497A (en) Method for realizing fruit classification and identification based on deep learning algorithm
CN114782997A (en) Pedestrian re-identification method and system based on multi-loss attention adaptive network
CN112364747A (en) Target detection method under limited sample
CN114359578A (en) Application method and system of pest and disease damage identification intelligent terminal
CN114463651A (en) Crop pest and disease identification method based on ultra-lightweight efficient convolutional neural network
CN109165675A (en) Image classification method based on periodically part connection convolutional neural networks
CN110490876B (en) Image segmentation method based on lightweight neural network
Sangamesh et al. A Novel Approach for Recognition of Face by Using Squeezenet Pre-Trained Network
Nie Face expression classification using squeeze-excitation based VGG16 network
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231026

Address after: 518000 1002, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Patentee after: Shenzhen Wanzhida Technology Co.,Ltd.

Address before: 541004 No. 15 Yucai Road, Qixing District, Guilin, the Guangxi Zhuang Autonomous Region

Patentee before: Guangxi Normal University

TR01 Transfer of patent right