WO2024021394A1 - 全局特征与阶梯型局部特征融合的行人重识别方法及装置 - Google Patents

全局特征与阶梯型局部特征融合的行人重识别方法及装置 Download PDF

Info

Publication number
WO2024021394A1
WO2024021394A1 PCT/CN2022/133947 CN2022133947W WO2024021394A1 WO 2024021394 A1 WO2024021394 A1 WO 2024021394A1 CN 2022133947 W CN2022133947 W CN 2022133947W WO 2024021394 A1 WO2024021394 A1 WO 2024021394A1
Authority
WO
WIPO (PCT)
Prior art keywords
pedestrian
feature
identification
features
local
Prior art date
Application number
PCT/CN2022/133947
Other languages
English (en)
French (fr)
Inventor
张登银
王敬余
赵乾
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Priority to US18/094,880 priority Critical patent/US20230162522A1/en
Publication of WO2024021394A1 publication Critical patent/WO2024021394A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the invention belongs to the technical field of digital image processing, and relates to a pedestrian re-identification method and device that integrates global features and stepped local features. Specifically, it relates to a pedestrian re-identification method that integrates global features and stepped local features guided by block weights.
  • the pedestrian re-identification problem is a cross-camera image retrieval problem, which aims to use the query graph to retrieve images of pedestrians belonging to the same identity in the image database. First, the original pedestrian video images are extracted from multiple cameras, and then other images of the pedestrian are confirmed through feature extraction and similarity measurement.
  • person re-identification is divided into representation learning and metric learning.
  • Representation learning treats the pedestrian re-identification problem as an image classification and verification problem.
  • Metric learning maps image features to a high-dimensional feature space and measures the similarity of two images through distance. Since 2014, more robust features extracted by convolutional neural networks have been used to find more accurate pedestrian images using simpler distance measurement formulas, which has greatly enhanced the accuracy and generalization ability of the pedestrian re-identification model.
  • a large number of scholars have proposed higher-quality algorithms in this research direction, and pedestrian re-identification research has ushered in explosive growth.
  • the present invention provides a pedestrian re-identification method and device that integrates global features and stepped local features. Based on the fusion of stepped local features guided by global features and block weights, it significantly improves It not only improves the effect of pedestrian re-identification, but also does not bring too much calculation amount. At the same time, it can solve the problem of low accuracy of pedestrian re-identification research algorithms caused by image occlusion, changes in shooting angles, low resolution and other phenomena.
  • a pedestrian re-identification method including:
  • the construction method of the pedestrian re-identification network model includes:
  • the pedestrian re-identification network includes a backbone network, an improved global feature branch and a ladder-type local feature extraction branch guided by block weights; the backbone network is Resnet50 and is loaded with pre-training weights; the improved global The feature branch is connected to the backbone network Conv5_x, including the channel attention module, multiple receptive field fusion module, GeM pooling layer, and fully connected layer, and is configured to extract global features of pedestrians; the ladder-type local feature extraction branch guided by the block weight After the backbone network Conv4_x, it includes a ladder blocking layer, a pooling layer, a spatial attention module, and a fully connected layer, which is configured to extract pedestrian local features; the pedestrian global features and pedestrian local features are connected as the final pedestrian features;
  • the improved method of constructing a global feature branch includes:
  • the feature map obtained from the backbone network Conv5_x is used as input, and significant pedestrian information is first extracted through the channel attention module, and then the feature information of pedestrians under different receptive fields is obtained through the multi-receptive field fusion module for fusion, and then passed through the GeM pooling layer. After GeM pooling, a 2048-dimensional feature vector is obtained, using the difficult sample sampling triplet loss constraint. At the same time, the feature vector is connected to the fully connected layer for dimensionality reduction, and a 512-dimensional global feature is obtained. The cross-entropy loss constraint is used, and the triplet loss constraint is used. Tuple loss and cross-entropy loss are jointly optimized and trained.
  • the input feature map uses maximum pooling and average pooling to obtain two one-dimensional vectors, which are then sent to the weight-sharing multi-layer perceptron, and the output is element-by-element.
  • the corresponding attention weight is obtained through Sigmoid activation;
  • the GeM pooling layer formula is:
  • the multiple receptive field fusion module includes 3 branches.
  • the input feature These three feature maps are fused into the final output X′.
  • the construction method of the block weight-guided ladder-type local feature extraction branch includes:
  • block weights are calculated for the 9 local feature maps obtained by the feature map obtained by the backbone network Conv4_x after passing through the spatial attention module and the ladder block layer, and the block weight is used to guide the cross-entropy loss.
  • the stepped blocking layer first divides the original complete pedestrian image features evenly into 12 horizontal blocks, initially with the first block as the starting block, and every 4 blocks as a whole as a local area, and then with a step size of 1 Change the starting block downwards for ladder-type block division, and finally obtain 9 local feature maps.
  • the spatial attention module first performs maximum pooling and average pooling of one channel dimension on the input H ⁇ W ⁇ C features in the channel attention module to obtain two H ⁇ W ⁇ 1 channel descriptions, and The two channel descriptions are spliced together according to the channel; then, through a 7 ⁇ 7 convolution layer, the activation function is Sigmoid, and the spatial attention weight coefficient of H ⁇ W ⁇ 1 is obtained;
  • the block weight calculation method includes: sending the H ⁇ W ⁇ 1 spatial attention weight coefficient output by the spatial attention module into the ladder blocking layer to obtain 9 local coefficient blocks, and dividing by the coefficient sum of each local coefficient block respectively.
  • Nine block weights are obtained by summing the coefficients of the nine coefficient blocks.
  • L total L Softmax +L tri_hard , where L Softmax is the cross-entropy loss, and L tri_hard is the difficult sample sampling triplet loss,
  • N is the number of batches
  • H represents the number of pedestrians
  • fi is the feature vector of image i
  • its true label is yi
  • W is the weight
  • b is the bias
  • bk is the bias vector of the k-th pedestrian
  • Difficult sample mining triplet loss function is used for training.
  • the triplet loss function selects the anchor point an, the positive sample pos, and the negative sample neg to form a triplet.
  • P pedestrians are selected in each batch.
  • a pedestrian selects K images, and the triples are all from P ⁇ K images.
  • the triplet loss is calculated by calculating the Euclidean distance to find the positive sample farthest from the anchor point and the nearest negative sample, where mar is
  • the set hyperparameters, d an,pos are the distances between the anchor point and the positive samples, d an,neg are the distances between the anchor points and the negative samples, A and B represent different sample sets in the P ⁇ K images, that is, the selected Positive samples and negative samples do not overlap; the minimization of the loss function is to maximize the distance between the anchor point and the negative samples and minimize the distance between the anchor point and the positive samples.
  • L Softmax_i represents the cross-entropy loss of the i-th local feature map
  • Wi is the block weight of the i-th local feature map.
  • the present invention provides a pedestrian re-identification device, including a processor and a storage medium;
  • the storage medium is used to store instructions
  • the processor is configured to operate according to the instructions to perform the steps of the method according to the first aspect.
  • the present invention provides a storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the steps of the method described in the first aspect are implemented.
  • the goal of this invention is to learn a more robust pedestrian feature representation to cope with complex pedestrian re-identification scenarios and achieve a good recognition effect.
  • the present invention designs a pedestrian re-identification method that integrates global features and stepped local features.
  • This method uses resnet50 as the backbone network to extract features from pedestrian images, and then connects two branches, which are the global feature branch and the ladder-type local feature branch guided by block weights.
  • the global branch introduces the channel attention module to extract more significant information from the feature map; then it is connected to the multiple receptive field fusion module, which extracts and then fuses features of the same input with different receptive fields to fully obtain pedestrian contextual information.
  • the local branch introduces the ladder blocking layer. This module horizontally blocks the feature map in a ladder shape.
  • This module can extract more detailed pedestrian information. At the same time, it guides the cross-entropy loss by calculating the block weight, so that the trained model pays more attention to the importance of pedestrians. Information. Finally, the dual-branch joint training strategy is used to train the model.
  • the present invention proposes a pedestrian re-identification method that integrates global features and ladder-type local features guided by block weights, which improves the accuracy of pedestrian re-identification.
  • the Resnet50 network is used as the backbone network to extract global features of pedestrian images; then it is sent to the designed branch network to extract global features and local features respectively; finally, the features of the two branches are fused, and the resulting features include both more abstract features and The global features also contain local features with more detailed information, so they are more robust.
  • the present invention uses generalized mean pooling as the aggregation module. This pooling is between maximum pooling and mean pooling. Through a unified pooling type, feature differences can be better captured.
  • the multiple receptive field fusion module can effectively aggregate the features of different receptive fields, further improving the pedestrian re-identification performance.
  • Figure 1 is a framework diagram of a pedestrian re-identification network according to an embodiment of the present invention
  • Figure 2 is a schematic diagram of a channel attention module according to an embodiment of the present invention.
  • Figure 3 is a schematic diagram of a multiple receptive field fusion module according to an embodiment of the present invention.
  • Figure 4 is a schematic diagram of a spatial attention module according to an embodiment of the present invention.
  • a pedestrian re-identification method includes:
  • the construction method of the pedestrian re-identification network model includes:
  • the pedestrian re-identification network includes a backbone network, an improved global feature branch and a ladder-type local feature extraction branch guided by block weights; the backbone network is Resnet50 and is loaded with pre-training weights; the improved global The feature branch is connected to the backbone network Conv5_x, including the channel attention module, multiple receptive field fusion module, GeM pooling layer, and fully connected layer, and is configured to extract global features of pedestrians; the ladder-type local feature extraction branch guided by the block weight After the backbone network Conv4_x, it includes a ladder blocking layer, a pooling layer, a spatial attention module, and a fully connected layer, which is configured to extract pedestrian local features; the pedestrian global features and pedestrian local features are connected as the final pedestrian features;
  • the improved method of constructing a global feature branch includes:
  • the feature map obtained from the backbone network Conv5_x is used as input, and significant pedestrian information is first extracted through the channel attention module, and then the feature information of pedestrians under different receptive fields is obtained through the multi-receptive field fusion module for fusion, and then passed through the GeM pooling layer. After GeM pooling, a 2048-dimensional feature vector is obtained, using the difficult sample sampling triplet loss constraint. At the same time, the feature vector is connected to the fully connected layer for dimensionality reduction, and a 512-dimensional global feature is obtained. The cross-entropy loss constraint is used, and the triplet loss constraint is used. Tuple loss and cross-entropy loss are jointly optimized and trained.
  • the input feature map uses maximum pooling and average pooling to obtain two one-dimensional vectors, which are then sent to the weight-sharing multi-layer perceptron, and the output is element-by-element.
  • the corresponding attention weight is obtained through Sigmoid activation;
  • the GeM pooling layer formula is:
  • the multiple receptive field fusion module includes 3 branches.
  • the input feature These three feature maps are fused into the final output X′.
  • the construction method of the block weight-guided ladder-type local feature extraction branch includes:
  • block weights are calculated for the 9 local feature maps obtained after the feature map obtained by the backbone network Conv4_x passes through the spatial attention module and the ladder block layer, and the block weight is used to guide the cross-entropy loss.
  • the stepped blocking layer first divides the original complete pedestrian image features evenly into 12 horizontal blocks, initially with the first block as the starting block, and every 4 blocks as a whole as a local area, and then with a step size of 1 Change the starting block downwards for ladder-type block division, and finally obtain 9 local feature maps.
  • the spatial attention module first performs maximum pooling and average pooling of one channel dimension on the input H ⁇ W ⁇ C features in the channel attention module to obtain two H ⁇ W ⁇ 1 channel descriptions, and The two channel descriptions are spliced together according to the channel; then, through a 7 ⁇ 7 convolution layer, the activation function is Sigmoid, and the spatial attention weight coefficient of H ⁇ W ⁇ 1 is obtained;
  • the block weight calculation method includes: sending the H ⁇ W ⁇ 1 spatial attention weight coefficient output by the spatial attention module into the ladder blocking layer to obtain 9 local coefficient blocks, and dividing by the coefficient sum of each local coefficient block respectively.
  • Nine block weights are obtained by summing the coefficients of the nine coefficient blocks.
  • L total L Softmax +L tri_hard , where L Softmax is the cross-entropy loss, and L tri_hard is the difficult sample sampling triplet loss,
  • N is the number of batches
  • H represents the number of pedestrians
  • fi is the feature vector of image i
  • its true label is yi
  • W is the weight
  • b is the bias
  • bk is the bias vector of the k-th pedestrian
  • Difficult sample mining triplet loss function is used for training.
  • the triplet loss function selects the anchor point an, the positive sample pos, and the negative sample neg to form a triplet.
  • P pedestrians are selected in each batch.
  • a pedestrian selects K images, and the triples are all from P ⁇ K images.
  • the triplet loss is calculated by calculating the Euclidean distance to find the positive sample farthest from the anchor point and the nearest negative sample, where mar is
  • the set hyperparameters, d an,pos are the distances between the anchor point and the positive samples, d an,neg are the distances between the anchor points and the negative samples, A and B represent different sample sets in the P ⁇ K images, that is, the selected Positive samples and negative samples do not overlap; the minimization of the loss function is to maximize the distance between the anchor point and the negative samples and minimize the distance between the anchor point and the positive samples.
  • L Softmax_i represents the cross-entropy loss of the i-th local feature map
  • Wi is the block weight of the i-th local feature map.
  • the pedestrian re-identification method provided by merging global features and ladder-type local features guided by block weights includes the following steps:
  • Step 1 Construct a pedestrian re-identification network, including a backbone network, an improved global feature branch, and a ladder-type local feature extraction branch guided by block weights; as shown in Figure 1;
  • the backbone network resnet50 is divided into 5 layers, in which the step size of the last convolutional layer is set from 2 to 1 so that the feature maps sampled by Conv4_x and Conv5_x have the same size;
  • N is the number of batches
  • H represents the number of pedestrians
  • fi is the feature vector of image i
  • its true label is yi
  • W is the weight
  • b is the bias
  • bk is the bias vector of the k-th pedestrian
  • Difficult sample mining triplet loss function is used for training.
  • the triplet loss function selects the anchor point an, the positive sample pos, and the negative sample neg to form a triplet.
  • P pedestrians are selected in each batch.
  • a pedestrian selects K images, and the triples are all from P ⁇ K images.
  • the triplet loss is calculated by calculating the Euclidean distance to find the positive sample farthest from the anchor point and the nearest negative sample, where mar is
  • the set hyperparameters, d an,pos are the distances between the anchor point and the positive samples, d an,neg are the distances between the anchor points and the negative samples, A and B represent different sample sets in the P ⁇ K images, that is, the selected Positive samples and negative samples do not overlap; the minimization of the loss function is to maximize the distance between the anchor point and the negative sample and minimize the distance between the anchor point and the positive sample;
  • the local branch training formula is:
  • L Softnax_i represents the cross-entropy loss of the i-th local feature map
  • the channel attention module in this embodiment is shown in Figure 2.
  • the input feature map uses maximum pooling and average pooling to obtain two one-dimensional vectors, and then is sent to the weight-sharing multi-dimensional vector.
  • the corresponding attention weight can be obtained by adding the output element by element and then sigmoid activation;
  • the multi-receptive field fusion module includes 3 branches.
  • the input pedestrian features Figure, these three feature maps are fused into the final output X′;
  • the pooling formula of the GeM pooling layer is:
  • the stepped blocking layer will first evenly divide the original complete pedestrian image features into 12 horizontal blocks. Initially, the first block will be the starting block, and every 4 blocks will be a whole as a local area, and then the step size will be 1. Change the starting block downwards for ladder-type block division, and finally obtain 9 local feature maps.
  • the spatial attention module first performs maximum pooling and average pooling of one channel dimension on the H ⁇ W ⁇ C features output by Conv4_x to obtain two H ⁇ W ⁇ 1 channel descriptions, and combines these two Descriptions are stitched together by channel. Then, after a 7 ⁇ 7 convolution layer, the activation function is Sigmoid, and the spatial attention weight coefficient of H ⁇ W ⁇ 1 is obtained.
  • the above-mentioned H ⁇ W ⁇ 1 spatial attention weight coefficient is used and sent to the ladder block layer to obtain 9 local coefficient blocks.
  • the sum of coefficients of each coefficient block is divided by the sum of coefficients of the 9 coefficient blocks. 9 block weights.
  • Step 2 Train the pedestrian re-identification network and obtain the trained pedestrian re-identification network model
  • training data from public data sources and preprocess it, divide the preprocessed image data into a training set and a test set, send the training set to the pedestrian re-identification network for training, and obtain the trained pedestrian re-identification network; pass the test Set test the trained pedestrian re-identification network. If it meets the preset requirements, stop training and obtain the trained pedestrian re-identification network. Otherwise, continue the training process;
  • the data comes from several public data sets, such as Market1501, DukeMTMC-Reid, and MSMT17. Images extracted from the data set are preprocessed through preprocessing methods such as horizontal flipping and random erasing;
  • This embodiment uses a loss function to measure the predictive ability of the deep learning model, and uses the loss function to supervise the model training process, thereby narrowing the gap between the real value and the predicted value;
  • the initial feature map is obtained through the backbone network ResNet50;
  • the input is a feature map with a channel number of 2048.
  • the channel attention module calculates the channel attention weight coefficient and multiplies it to obtain the feature with attention weight.
  • the output feature map channel number is still 2048;
  • the features with attention weights are input into the multi-receptive field fusion module, and the input pedestrian features
  • the branch performs a convolution operation to obtain 3 feature maps.
  • the three feature maps have the same size and the number of channels is 2048.
  • the three feature maps are added and fused to form the final output;
  • the 2048-dimensional feature map is GeM pooled to obtain a 1 ⁇ 1 ⁇ 2048 feature vector, which is constrained by a triplet loss.
  • the feature vector is connected to a fully connected layer for dimensionality reduction, and a 512-dimensional feature is obtained.
  • label smoothing cross-entropy loss is used for classification learning, and triplet loss and cross-entropy loss are used for joint optimization training;
  • the input is a feature map with a channel number of 1024.
  • 9 local feature maps are obtained through the ladder block layer, and 9 1024-dimensional feature vectors are obtained by pooling operations respectively, and then fully connected for dimensionality reduction to obtain 9 256-dimensional feature vector, all the features reduced to 256 dimensions are sent to the fully connected layer respectively, and then cross-entropy loss is used for classification learning.
  • the calculated block weight is multiplied by the cross-entropy loss of each local feature map, and the results are compared. Add to get the final local branch loss;
  • the training phase uses joint training of local branches and global branches.
  • Step 3 Extract the pedestrian features of the image to be recognized through the trained model, match the extracted features with the features corresponding to each image in the gallery, and output the top N pedestrian images based on their similarity to the image to be recognized;
  • the global branch is reduced to a 512-dimensional feature vector and the local branch's nine 256-dimensional feature vectors are connected as the final feature.
  • the model performance evaluation results mAP and rank-1 are obtained. , rank-5 and rank-10.
  • this embodiment provides a pedestrian re-identification device, including a processor and a storage medium;
  • the storage medium is used to store instructions
  • the processor is configured to operate according to the instructions to perform the steps of the method according to Embodiment 1.
  • this embodiment provides a storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the steps of the method described in Embodiment 1 are implemented.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

一种全局特征与阶梯型局部特征融合的行人重识别方法及装置,方法包括:利用预训练好的行人重识别网络模型对待识别图像、图库图像分别进行提取得到待识别图像行人特征、图库图像行人特征;将待识别图像行人特征与图库图像行人特征进行相似度匹配,输出相似度排名前N的行人图像,作为行人重识别结果;行人重识别网络包括骨干网络、改进的全局特征分支和块权重指导的阶梯型局部特征提取分支;采用公开的数据集训练行人重识别网络,获得训练好的行人重识别网络模型。该方法能应对复杂的行人重识别场景,达到一个好的识别效果。

Description

全局特征与阶梯型局部特征融合的行人重识别方法及装置 技术领域
本发明属于数字图像处理技术领域,涉及一种全局特征与阶梯型局部特征融合的行人重识别方法及装置,具体涉及一种全局特征与块权重指导的阶梯型局部特征融合的行人重识别方法。
背景技术
行人重识别问题是跨摄像头的图像检索问题,旨在利用查询图在图像库中检索出属于同一身份的行人图像。首先从多个摄像头中提取原始行人视频图像,依次通过特征提取、相似度度量确认该行人其他图像。
当前,由于摄像头角度、天气等因素的影响,获得的原始视频图像质量不高。遮挡、模糊的图像会严重影响行人重识别的精度,因此在低质量图像上学习出高精度行人重识别模型成为研究重点。
根据损失类型,行人重识别分为表征学习和度量学习。表征学习将行人重识别问题当作图像分类和验证问题,度量学习将图像特征映射到高维特征空间,通过距离度量出两张图像相似度。2014年以来,通过卷积神经网络提取的更具鲁棒性的特征,利用更简单的距离度量公式查找更准确的行人图像,大大增强了行人重识别模型的精度与泛化能力。在本研究方向上大量学者提出更高质量算法,行人重识别研究迎来了爆发式增长。
然而,在真实的场景下,不同摄像头的同一行人由于受到光照、姿态、遮挡、分辨率等各种因素的影响,往往呈现很大的外观差异,这给行人重识别的研究与应用带来诸多挑战。因此,如何提取更具判别力的行人特征,并采用高效的相似性度量方法以减小类内差距,增大类间差距成为行人重识别的关键问题。
发明内容
目的:为了克服现有技术中存在的不足,本发明提供一种全局特征与阶梯型局部特征融合的行人重识别方法及装置,基于全局特征与块权重指导的阶梯型局部特征融合,在显著提升了行人重识别效果的同时,不会带来过多的计算量,同时能够解决图像遮挡、拍摄角度变化、分辨率不高等现象所带来的行人重识别研究算法精度不高的问题。
技术方案:为解决上述技术问题,本发明采用的技术方案为:
第一方面,提供一种行人重识别方法,包括:
获取待识别图像和图库图像;
利用预训练好的行人重识别网络模型对所述待识别图像、图库图像分别进行提取得到待识别图像行人特征、图库图像行人特征;
将待识别图像行人特征与图库图像行人特征进行相似度匹配,输出相似度排名前N的行人图像,作为行人重识别结果;
其中,所述行人重识别网络模型的构建方法包括:
构建行人重识别网络,行人重识别网络包括骨干网络、改进的全局特征分支和块 权重指导的阶梯型局部特征提取分支;所述骨干网络为Resnet50,并加载好预训练权重;所述改进的全局特征分支接在骨干网络Conv5_x上,包括通道注意力模块、多重感受野融合模块、GeM池化层、全连接层,被配置为提取行人全局特征;所述块权重指导的阶梯型局部特征提取分支接在骨干网络Conv4_x后,包括阶梯分块层、池化层、空间注意力模块、全连接层,被配置为提取行人局部特征;所述行人全局特征和行人局部特征连接起来作为最终行人特征;
采用公开的数据集训练行人重识别网络,获得训练好的行人重识别网络模型。
在一些实施例中,所述改进的全局特征分支的构建方法包括:
将从骨干网络Conv5_x得到的特征图作为输入,先经过通道注意力模块提取显著的行人信息,再通过多重感受野融合模块获取行人不同感受野下的特征信息进行融合,之后经过GeM池化层进行GeM池化,得到2048维的特征向量,使用难样本采样三元组损失约束,同时该特征向量接到全连接层上进行降维,得到512维的全局特征,使用交叉熵损失约束,利用三元组损失和交叉熵损失进行联合优化训练。
进一步地,所述通道注意力模块中,输入的特征图同时采用最大池化与平均池化得到两个一维的向量,之后被送进权重共享的多层感知机中,将输出进行逐元素的相加后经过Sigmoid激活得到对应的注意力权重;
所述GeM池化层公式为:
Figure PCTCN2022133947-appb-000001
其中,X为GeM池化层的输入,f为GeM池化层的输出,p k是一个超参数,在反向传播的过程中学习;
所述多重感受野融合模块包含3个分支,对输入的特征X分别经过卷积核大小为3×3,空洞率分别为1、2、3的分支进行卷积操作得到3个特征图,将此3个特征图进行融合为最终的输出X′。
在一些实施例中,所述块权重指导的阶梯型局部特征提取分支的构建方法包括:
将经过骨干网络Conv4_x得到的特征图作为输入,通过阶梯分块层得到9个局部特征图,对9个局部特征图进行池化操作得到9个1024维特征向量,然后经过第一全连接层进行降维得到9个256维特征向量,将所有降至256维的特征向量分别送入第二全连接层后使用交叉熵损失进行分类学习;
同时对骨干网络Conv4_x得到的特征图经过空间注意力模块、阶梯分块层后得到的9个局部特征图计算块权重,用块权重指导交叉熵损失。
进一步地,所述阶梯分块层首先将原始完整行人图像特征均匀分成12个水平块,最初以第1块为起始块,每4块为一个整体作为一个局部区域,随后以步长为1往下更改起始块进行阶梯型分块,最终得到9个局部特征图。
所述空间注意力模块,先在通道注意力模块中对输入H×W×C的特征先分别进行一个通道维度的最大池化和平均池化得到两个H×W×1的通道描述,并将这两个通道描述按照通道拼接在一起;然后,经过一个7×7的卷积层,激活函数为Sigmoid,得到H×W×1的空间注意力权重系数;
所述块权重计算方法包括:将空间注意力模块输出的H×W×1的空间注意力权重系数送入阶梯分块层得到9个局部系数块,分别用每个局部系数块的系数和除以9个系数块的系数总和得到9个块权重。
在一些实施例中,所述行人重识别网络模型采用双分支联合训练进行训练,联合训练损失函数L total=L global+L local,其中L global代表改进的全局特征分支损失,L local代表块权重指导的阶梯型局部特征分支损失。
进一步地,改进的全局特征分支损失L total=L Softmax+L tri_hard,其中L Softmax为交叉熵损失,L tri_hard为难样本采样三元组损失,
Figure PCTCN2022133947-appb-000002
其中,N是批次数,H表示行人数,fi是图像i的特征向量,其真实标签为y i,W为权重,b是偏置;
Figure PCTCN2022133947-appb-000003
是第k个行人的权重向量的转置,bk是第k个行人的偏置向量;
Figure PCTCN2022133947-appb-000004
采用难例样本挖掘三元组损失函数进行训练,三元组损失函数选取锚点an、正样本pos、负样本neg构成三元组,训练时在每个批次中挑选出P个行人,每个行人挑选出K张图像,三元组均来自于P×K张图像,通过计算欧氏距离找到离锚点距离最远的正样本、最近的负样本来计算三元组损失,其中mar为设置的超参数,d an,pos是锚点与正样本的距离,d an,neg是锚点与负样本的距离,A、B表示该P×K张图像中不同的样本集,即所选正样本和负样本不重合;损失函数的最小化,就是锚点与负样本之间的距离最大化、锚点与正样之间的距离最小化。
进一步地,权重指导的阶梯型局部特征分支损失
Figure PCTCN2022133947-appb-000005
其中n为局部特征块的个数,L Softmax_i代表第i个局部特征图的交叉熵损失,W i为第i个局部特征图的块权重。
第二方面,本发明提供了一种行人重识别装置,包括处理器及存储介质;
所述存储介质用于存储指令;
所述处理器用于根据所述指令进行操作以执行根据第一方面所述方法的步骤。
第三方面,本发明提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现第一方面所述方法的步骤。
本发明的目标是学习鲁棒性更强的行人特征表示,以应对复杂的行人重识别场景,达到一个好的识别效果。本发明设计了全局特征与阶梯型局部特征融合的行人重识别方法。本方法采用resnet50作为骨干网络对行人图片进行特征提取,之后连接两个分支,分别为全局特征分支和块权重指导的阶梯型局部特征分支。全局分支引入了通道注意力模块提取特征图更显著的信息;再接入多重感受野融合模块,该模块以不同感受野对同一个输入进行特征提取再融合的方式,充分获取行人上下文信息。局部分支引入阶梯分块层,该模 块以阶梯型对特征图进行水平分块,该模块可以提取到更细节的行人信息,同时通过计算块权重指导交叉熵损失,使训练的模型更关注行人重要的信息。最后采用双分支联合训练的策略对模型进行训练。
有益效果:本发明提供的全局特征与阶梯型局部特征融合的行人重识别方法及装置,具有以下优点:
(1)本发明提出了全局特征与块权重指导的阶梯型局部特征融合的行人重识别方法,提高了行人重识别的准确率。首先将Resnet50网络作为骨干网络提取行人图像的全局特征;然后送入到所设计的分支网络分别提取全局特征和局部特征,最后将两个分支的特征进行融合,所得到的特征既包含了更抽象的全局特征,又包含了细节信息更多的局部特征,因此具有更强的鲁棒性。
(2)本发明采用广义均值池化作为聚合模块,该池化介于最大池化和均值池化之间,通过一种统一池化类型,能更好的捕捉特征差异性。
(3)多重感受野融合模块可以有效聚合不同感受野的特征,使行人重识别性能进一步得到提升。
(4)通过阶梯分块层阶梯型划分图像区域从而加强局部特征之间的联系,能够避免特征学习过程中某些重要信息的丢失。
(5)通过设计的块权重对局部分支的交叉熵损失进行指导,可以使模型训练过程中更关注图片的关键信息,训练得到的模型也能更好的提取关键特征。
附图说明
图1为根据本发明一实施例的行人重识别网络框架图;
图2为根据本发明一实施例的通道注意力模块示意图;
图3为根据本发明一实施例中多重感受野融合模块示意图;
图4为根据本发明一实施例中空间注意力模块示意图;
具体实施方式
下面结合附图和实施例对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。
在本发明的描述中,若干的含义是一个以上,多个的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。
本发明的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
实施例1
一种行人重识别方法,包括:
获取待识别图像和图库图像;
利用预训练好的行人重识别网络模型对所述待识别图像、图库图像分别进行提取得到待识别图像行人特征、图库图像行人特征;
将待识别图像行人特征与图库图像行人特征进行相似度匹配,输出相似度排名前N的行人图像,作为行人重识别结果;
其中,所述行人重识别网络模型的构建方法包括:
构建行人重识别网络,行人重识别网络包括骨干网络、改进的全局特征分支和块权重指导的阶梯型局部特征提取分支;所述骨干网络为Resnet50,并加载好预训练权重;所述改进的全局特征分支接在骨干网络Conv5_x上,包括通道注意力模块、多重感受野融合模块、GeM池化层、全连接层,被配置为提取行人全局特征;所述块权重指导的阶梯型局部特征提取分支接在骨干网络Conv4_x后,包括阶梯分块层、池化层、空间注意力模块、全连接层,被配置为提取行人局部特征;所述行人全局特征和行人局部特征连接起来作为最终行人特征;
采用公开的数据集训练行人重识别网络,获得训练好的行人重识别网络模型。
在一些实施例中,所述改进的全局特征分支的构建方法包括:
将从骨干网络Conv5_x得到的特征图作为输入,先经过通道注意力模块提取显著的行人信息,再通过多重感受野融合模块获取行人不同感受野下的特征信息进行融合,之后经过GeM池化层进行GeM池化,得到2048维的特征向量,使用难样本采样三元组损失约束,同时该特征向量接到全连接层上进行降维,得到512维的全局特征,使用交叉熵损失约束,利用三元组损失和交叉熵损失进行联合优化训练。
进一步地,所述通道注意力模块中,输入的特征图同时采用最大池化与平均池化得到两个一维的向量,之后被送进权重共享的多层感知机中,将输出进行逐元素的相加后经过Sigmoid激活得到对应的注意力权重;
所述GeM池化层公式为:
Figure PCTCN2022133947-appb-000006
其中,X为GeM池化层的输入,f为GeM池化层的输出,p k是一个超参数,在反向传播的过程中学习;
所述多重感受野融合模块包含3个分支,对输入的特征X分别经过卷积核大小为3×3,空洞率分别为1、2、3的分支进行卷积操作得到3个特征图,将此3个特征图进行融合为最终的输出X′。
在一些实施例中,所述块权重指导的阶梯型局部特征提取分支的构建方法包括:
将经过骨干网络Conv4_x得到的特征图作为输入,通过阶梯分块层得到9个局部特征图,对9个局部特征图进行池化操作得到9个1024维特征向量,然后经过第一全连接层进行降维得到9个256维特征向量,将所有降至256维的特征向量分别送入第二全连接层后使用交叉熵损失进行分类学习;
同时对骨干网络Conv4_x得到的特征图经过空间注意力模块、阶梯分块层后得到的9个局部特征图计算块权重,用块权重指导交叉熵损失。
进一步地,所述阶梯分块层首先将原始完整行人图像特征均匀分成12个水平块,最初以第1块为起始块,每4块为一个整体作为一个局部区域,随后以步长为1往下更改起始块进行阶梯型分块,最终得到9个局部特征图。
所述空间注意力模块,先在通道注意力模块中对输入H×W×C的特征先分别进行一个通道维度的最大池化和平均池化得到两个H×W×1的通道描述,并将这两个通道描述按照通道拼接在一起;然后,经过一个7×7的卷积层,激活函数为Sigmoid,得到H×W×1的空间注意力权重系数;
所述块权重计算方法包括:将空间注意力模块输出的H×W×1的空间注意力权重系数送入阶梯分块层得到9个局部系数块,分别用每个局部系数块的系数和除以9个系数块的系数总和得到9个块权重。
在一些实施例中,所述行人重识别网络模型采用双分支联合训练进行训练,联合训练损失函数L total=L global+L local,其中L global代表改进的全局特征分支损失,L local代表块权重指导的阶梯型局部特征分支损失。
进一步地,改进的全局特征分支损失L total=L Softmax+L tri_hard,其中L Softmax为交叉熵损失,L tri_hard为难样本采样三元组损失,
Figure PCTCN2022133947-appb-000007
其中,N是批次数,H表示行人数,fi是图像i的特征向量,其真实标签为y i,W为权重,b是偏置;
Figure PCTCN2022133947-appb-000008
是第k个行人的权重向量的转置,bk是第k个行人的偏置向量;
Figure PCTCN2022133947-appb-000009
采用难例样本挖掘三元组损失函数进行训练,三元组损失函数选取锚点an、正样本pos、负样本neg构成三元组,训练时在每个批次中挑选出P个行人,每个行人挑选出K张图像,三元组均来自于P×K张图像,通过计算欧氏距离找到离锚点距离最远的正样本、最近的负样本来计算三元组损失,其中mar为设置的超参数,d an,pos是锚点与正样本的距离,d an,neg是锚点与负样本的距离,A、B表示该P×K张图像中不同的样本集,即所选正样本和负样本不重合;损失函数的最小化,就是锚点与负样本之间的距离最大化、锚点与正样之间的距离最小化。
进一步地,权重指导的阶梯型局部特征分支损失
Figure PCTCN2022133947-appb-000010
其中n为局部特征块的个数,L Softmax_i代表第i个局部特征图的交叉熵损失,W i为第i个局部特征图的块权重。
在一些实施例中,提供的全局特征与块权重指导的阶梯型局部特征融合的行人重识别方法,包括以下步骤:
步骤1:构建行人重识别网络,包括骨干网络、改进的全局特征分支、块权重指导的阶梯型局部特征提取分支;如图1所示;
本实施例中骨干网络resnet50分为5层,其中把最后一个卷积层的步长由2设为1使Conv4_x与Conv5_x采样的特征图具有相同的尺寸;
本实施例采用双分支网络进行联合训练,其中全局特征分支包括通道注意力模块、多重感受野融合模块、GeM池化层、FC层;块权重指导的局部特征分支包括阶梯分块层、GeM池化层、空间注意力模块、FC层;联合训练公式为L total=L global+L local,其中L global代表改进的全局特征分支损失,L local代表块权重指导的阶梯型局部特征分支损失;
本实施例中全局分支训练公式为L global=L Softmax+L tri_hard,其中L Softmax为交叉熵损失,L tri_hard为难样本采样三元组损失,分别介绍两个公式:
Figure PCTCN2022133947-appb-000011
其中,N是批次数,H表示行人数,fi是图像i的特征向量,其真实标签为y i,W为权重,b是偏置;
Figure PCTCN2022133947-appb-000012
是第k个行人的权重向量的转置,bk是第k个行人的偏置向量;
Figure PCTCN2022133947-appb-000013
采用难例样本挖掘三元组损失函数进行训练,三元组损失函数选取锚点an、正样本pos、负样本neg构成三元组,训练时在每个批次中挑选出P个行人,每个行人挑选出K张图像,三元组均来自于P×K张图像,通过计算欧氏距离找到离锚点距离最远的正样本、最近的负样本来计算三元组损失,其中mar为设置的超参数,d an,pos是锚点与正样本的距离,d an,neg是锚点与负样本的距离,A、B表示该P×K张图像中不同的样本集,即所选正样本和负样本不重合;损失函数的最小化,就是锚点与负样本之间的距离最大化、锚点与正样之间的距离最小化;
本实施例中局部分支训练公式为:
Figure PCTCN2022133947-appb-000014
其中n为局部特征块数,L Softnax_i代表第i个局部特征图的交叉熵损失,W i为第i个局部特征图的块权重,本实施例中n=9;
本实施例中通道注意力模块如图2所示,在通道注意力模块中,输入的特征图同时采用最大池化与平均池化得到两个一维的向量,之后被送进权重共享的多层感知机中,将输出进行逐元素的相加后经过Sigmoid激活即可得到对应的注意力权重;
本实施例中多重感受野融合模块包含3个分支,对输入的行人特征X分别经过卷积核大小为3×3,空洞率分别为1、2、3的分支进行卷积操作得到3个特征图,将此3个特征图进行融合为最终的输出X′;
本实施例中GeM池化层所述池化公式为:
Figure PCTCN2022133947-appb-000015
其中,X为池化层的输入,f为池化层的输出,p k是一个超参数,在反向传播的过程 中学习;
本实施例中阶梯分块层将首先将原始完整行人图像特征均匀分成12个水平块,最初以第1块为起始块,每4块为一个整体作为一个局部区域,随后以步长为1往下更改起始块进行阶梯型分块,最终得到9个局部特征图。
本实施例中空间注意力模块将Conv4_x输出的H×W×C的特征先分别进行一个通道维度的最大池化和平均池化得到两个H×W×1的通道描述,并将这两个描述按照通道拼接在一起。然后,经过一个7×7的卷积层,激活函数为Sigmoid,得到H×W×1的空间注意力权重系数。
本实施例中使用上述H×W×1的空间注意力权重系数,送入阶梯分块层得到9个局部系数块,分别用每个系数块的系数和除以9个系数块的系数总和得到9个块权重。
步骤2:训练行人重识别网络,获得训练好的行人重识别网络模型;
从公开数据源中获取训练数据并预处理,将预处理后的图像数据分为训练集和测试集,将训练集送入行人重识别网络进行训练,获得训练后的行人重识别网络;通过测试集测试训练后的行人重识别网络,若满足预设要求,则停止训练,获得训练好的行人重识别网络,否则继续执行训练过程;
本实施例中,数据来源于几个公开数据集,如Market1501、DukeMTMC-Reid、MSMT17。从数据集中提取的图像通过水平翻转、随机擦除等预处理方法进行预处理;
本实施例中,对于给定尺寸大小为H×W×3的输入行人图像,首先将其调整为384×128×3的图像大小,之后使用随机擦除、图像翻转等方法对输入图像进行预处理;
本实施例采用损失函数衡量深度学习模型的预测能力,使用损失函数监督模型训练过程,从而缩小真实值与预测值差距的目的;
首先,对于给定大小为384×128×3的行人图像,经过骨干网ResNet50得到初始特征图;
将骨干网Conv4_x和Conv5_x输出的特征分别送入局部分支和全局分支进行进一步的特征提取;
对于全局分支,输入是通道数为2048的特征图,经过通道注意力模块计算通道注意力权重系数与之相乘后得到带注意力权重的特征,输出特征图通道数仍为2048;
将该带注意力权重的特征输入多重感受野融合模块,对输入的行人特征X分别经过卷积核大小为3×3,卷积核个数为2048,空洞率分别为1、2、3的分支进行卷积操作得到3个特征图,3个特征图尺寸相同,通道数都为2048,将此3个特征图进行相加融合为最终的输出;
将该2048维的特征图进行GeM池化得到1×1×2048的特征向量,使用三元组损失对其进行约束,同时该特征向量接到全连接层上进行降维,得到512维的特征向量,送入全连接层后使用标签平滑的交叉熵损失进行分类学习,利用三元组损失和交叉熵损失进行联合优化训练;
对于局部分支,输入是通道数为1024的特征图,通过阶梯分块层得到9个局部特征图,对其分别进行池化操作得到9个1024维特征向量,然后全连接进行降维得到9个256维特征向量,将所有降至256维的特征分别送入全连接层后使用交叉熵损失进行分类学习,同时使用计算的块权重乘上每个局部特征图的交叉熵损失,并将结果相加得到最终的局部分支 损失;
训练阶段采用局部分支和全局分支联合训练的方式,训练公式为L total=L global+L local,直到总损失值L total为最小时停止训练。
步骤3:通过训练的模型提取待识别图像的行人特征,将提取的特征与图库中各图像所对应的特征进行匹配,根据与待识别图像的相似度输出排名前N的行人图像;
将全局分支降至512维的特征向量和局部分支9个256维特征向量连接起来作为最终特征,通过计算查询图像与图像库中图像的余弦相似度,得出模型性能评价结果mAP、rank-1、rank-5和rank-10。
实施例2
第二方面,本实施例提供了一种行人重识别装置,包括处理器及存储介质;
所述存储介质用于存储指令;
所述处理器用于根据所述指令进行操作以执行根据实施例1所述方法的步骤。
实施例3
第三方面,本实施例提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现实施例1所述方法的步骤。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (10)

  1. 一种行人重识别方法,其特征在于,包括:
    获取待识别图像和图库图像;
    利用预训练好的行人重识别网络模型对所述待识别图像、图库图像分别进行提取得到待识别图像行人特征、图库图像行人特征;
    将待识别图像行人特征与图库图像行人特征进行相似度匹配,输出相似度排名前N的行人图像,作为行人重识别结果;
    其中,所述行人重识别网络模型的构建方法包括:
    构建行人重识别网络,行人重识别网络包括骨干网络、改进的全局特征分支和块权重指导的阶梯型局部特征提取分支;所述骨干网络为Resnet50,并加载好预训练权重;所述改进的全局特征分支接在骨干网络Conv5_x上,包括通道注意力模块、多重感受野融合模块、GeM池化层、全连接层,被配置为提取行人全局特征;所述块权重指导的阶梯型局部特征提取分支接在骨干网络Conv4_x后,包括阶梯分块层、池化层、空间注意力模块、全连接层,被配置为提取行人局部特征;所述行人全局特征和行人局部特征连接起来作为最终行人特征;
    采用公开的数据集训练行人重识别网络,获得训练好的行人重识别网络模型。
  2. 根据权利要求1所述的行人重识别方法,其特征在于,所述改进的全局特征分支的构建方法包括:
    将从骨干网络Conv5_x得到的特征图作为输入,先经过通道注意力模块提取显著的行人信息,再通过多重感受野融合模块获取行人不同感受野下的特征信息进行融合,之后经过GeM池化层进行GeM池化,得到2048维的特征向量,使用难样本采样三元组损失约束,同时该特征向量接到全连接层上进行降维,得到512维的全局特征,使用交叉熵损失约束,利用三元组损失和交叉熵损失进行联合优化训练。
  3. 根据权利要求1或2所述的行人重识别方法,其特征在于,所述通道注意力模块中,输入的特征图同时采用最大池化与平均池化得到两个一维的向量,之后被送进权重共享的多层感知机中,将输出进行逐元素的相加后经过Sigmoid激活得到对应的注意力权重;
    和/或,所述GeM池化层公式为:
    Figure PCTCN2022133947-appb-100001
    其中,X为GeM池化层的输入,f为GeM池化层的输出,p k是一个超参数,在反向传播的过程中学习;
    和/或,所述多重感受野融合模块包含3个分支,对输入的特征X分别经过卷积核大小为3×3,空洞率分别为1、2、3的分支进行卷积操作得到3个特征图,将此3个特征图进行融合为最终的输出X′。
  4. 根据权利要求1所述的行人重识别方法,其特征在于,所述块权重指导的阶梯型局部特征提取分支的构建方法包括:
    将经过骨干网络Conv4_x得到的特征图作为输入,通过阶梯分块层得到9个局部特征图,对9个局部特征图进行池化操作得到9个1024维特征向量,然后经过第一全连接层进行降维得到9个256维特征向量,将所有降至256维的特征向量分别送入第二全连接层后使用 交叉熵损失进行分类学习;
    同时对骨干网络Conv4_x得到的特征图经过空间注意力模块、阶梯分块层后得到的9个局部特征图计算块权重,用块权重指导交叉熵损失。
  5. 根据权利要求1或4所述的行人重识别方法,其特征在于,所述阶梯分块层首先将原始完整行人图像特征均匀分成12个水平块,最初以第1块为起始块,每4块为一个整体作为一个局部区域,随后以步长为1往下更改起始块进行阶梯型分块,最终得到9个局部特征图。
  6. 根据权利要求4所述的行人重识别方法,其特征在于,所述空间注意力模块,先在通道注意力模块中对输入H×W×C的特征先分别进行一个通道维度的最大池化和平均池化得到两个H×W×1的通道描述,并将这两个通道描述按照通道拼接在一起;然后,经过一个7×7的卷积层,激活函数为Sigmoid,得到H×W×1的空间注意力权重系数;
    所述块权重计算方法包括:将空间注意力模块输出的H×W×1的空间注意力权重系数送入阶梯分块层得到9个局部系数块,分别用每个局部系数块的系数和除以9个系数块的系数总和得到9个块权重。
  7. 根据权利要求1所述的行人重识别方法,其特征在于,所述行人重识别网络模型采用双分支联合训练进行训练,联合训练损失函数L total=L global+L local,其中L global代表改进的全局特征分支损失,L local代表块权重指导的阶梯型局部特征分支损失。
  8. 根据权利要求7所述的行人重识别方法,其特征在于,改进的全局特征分支损失L total=L Softmax+L tri_hard,其中L Softmax为交叉熵损失,L tri_hard为难样本采样三元组损失,
    Figure PCTCN2022133947-appb-100002
    其中,N是批次数,H表示行人数,fi是图像i的特征向量,其真实标签为y i,W为权重,b是偏置;
    Figure PCTCN2022133947-appb-100003
    是第k个行人的权重向量的转置,bk是第k个行人的偏置向量;
    Figure PCTCN2022133947-appb-100004
    采用难例样本挖掘三元组损失函数进行训练,三元组损失函数选取锚点an、正样本pos、负样本neg构成三元组,训练时在每个批次中挑选出P个行人,每个行人挑选出K张图像,三元组均来自于P×K张图像,通过计算欧氏距离找到离锚点距离最远的正样本、最近的负样本来计算三元组损失,其中mar为设置的超参数,d an,pos是锚点与正样本的距离,d an,neg是锚点与负样本的距离,A、B表示该P×K张图像中不同的样本集,即所选正样本和负样本不重合;损失函数的最小化,就是锚点与负样本之间的距离最大化、锚点与正样之间的距离最小化。
  9. 根据权利要求7所述的行人重识别方法,其特征在于,权重指导的阶梯型局部特征分支损失
    Figure PCTCN2022133947-appb-100005
    其中n为局部特征块的个数,L Softmax_i代表第i个局部特征图的交叉熵损失,W i为第i个局部特征图的块权重。
  10. 一种行人重识别装置,其特征在于,包括处理器及存储介质;
    所述存储介质用于存储指令;
    所述处理器用于根据所述指令进行操作以执行根据权利要求1至9任一项所述方法的步骤。
PCT/CN2022/133947 2022-07-29 2022-11-24 全局特征与阶梯型局部特征融合的行人重识别方法及装置 WO2024021394A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/094,880 US20230162522A1 (en) 2022-07-29 2023-01-09 Person re-identification method of integrating global features and ladder-shaped local features and device thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210906148.6A CN115171165A (zh) 2022-07-29 2022-07-29 全局特征与阶梯型局部特征融合的行人重识别方法及装置
CN202210906148.6 2022-07-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/094,880 Continuation US20230162522A1 (en) 2022-07-29 2023-01-09 Person re-identification method of integrating global features and ladder-shaped local features and device thereof

Publications (1)

Publication Number Publication Date
WO2024021394A1 true WO2024021394A1 (zh) 2024-02-01

Family

ID=83476623

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133947 WO2024021394A1 (zh) 2022-07-29 2022-11-24 全局特征与阶梯型局部特征融合的行人重识别方法及装置

Country Status (2)

Country Link
CN (1) CN115171165A (zh)
WO (1) WO2024021394A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117671396A (zh) * 2024-02-02 2024-03-08 新疆盛诚工程建设有限责任公司 施工进度的智能监控预警系统及方法
CN117764988A (zh) * 2024-02-22 2024-03-26 山东省计算中心(国家超级计算济南中心) 基于异核卷积多感受野网络的道路裂缝检测方法及系统
CN117876824A (zh) * 2024-03-11 2024-04-12 华东交通大学 多模态人群计数模型训练方法、系统、存储介质及设备
CN117876824B (zh) * 2024-03-11 2024-05-10 华东交通大学 多模态人群计数模型训练方法、系统、存储介质及设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115171165A (zh) * 2022-07-29 2022-10-11 南京邮电大学 全局特征与阶梯型局部特征融合的行人重识别方法及装置
CN115841683B (zh) * 2022-12-27 2023-06-20 石家庄铁道大学 一种联合多级特征的轻量行人重识别方法
CN116524602B (zh) * 2023-07-03 2023-09-19 华东交通大学 基于步态特征的换衣行人重识别方法及系统
CN116912889B (zh) * 2023-09-12 2024-01-05 深圳须弥云图空间科技有限公司 行人重识别方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408492A (zh) * 2021-07-23 2021-09-17 四川大学 一种基于全局-局部特征动态对齐的行人重识别方法
CN113516012A (zh) * 2021-04-09 2021-10-19 湖北工业大学 一种基于多层级特征融合的行人重识别方法及系统
CN115171165A (zh) * 2022-07-29 2022-10-11 南京邮电大学 全局特征与阶梯型局部特征融合的行人重识别方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516012A (zh) * 2021-04-09 2021-10-19 湖北工业大学 一种基于多层级特征融合的行人重识别方法及系统
CN113408492A (zh) * 2021-07-23 2021-09-17 四川大学 一种基于全局-局部特征动态对齐的行人重识别方法
CN115171165A (zh) * 2022-07-29 2022-10-11 南京邮电大学 全局特征与阶梯型局部特征融合的行人重识别方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHI, YUEXIANG; ZHOU, YUE: "Person Re-identification Based on Stepped Feature Space Segmentation and Local Attention Mechanism", JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, vol. 44, no. 1, 31 January 2022 (2022-01-31), CN , pages 195 - 202, XP009552735, ISSN: 1009-5896, DOI: 10.11999/JEIT201006 *
ZHANG, XIAOHAN: "Improved Person Re-identification Based on Global Feature", COMPUTER SYSTEMS AND APPLICATIONS, vol. 31, no. 5, 11 April 2022 (2022-04-11), CN , pages 298 - 301, XP009552736, ISSN: 1003-3254, DOI: 10.15888/j.cnki.csa.008477 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117671396A (zh) * 2024-02-02 2024-03-08 新疆盛诚工程建设有限责任公司 施工进度的智能监控预警系统及方法
CN117671396B (zh) * 2024-02-02 2024-04-26 新疆盛诚工程建设有限责任公司 施工进度的智能监控预警系统及方法
CN117764988A (zh) * 2024-02-22 2024-03-26 山东省计算中心(国家超级计算济南中心) 基于异核卷积多感受野网络的道路裂缝检测方法及系统
CN117764988B (zh) * 2024-02-22 2024-04-30 山东省计算中心(国家超级计算济南中心) 基于异核卷积多感受野网络的道路裂缝检测方法及系统
CN117876824A (zh) * 2024-03-11 2024-04-12 华东交通大学 多模态人群计数模型训练方法、系统、存储介质及设备
CN117876824B (zh) * 2024-03-11 2024-05-10 华东交通大学 多模态人群计数模型训练方法、系统、存储介质及设备

Also Published As

Publication number Publication date
CN115171165A (zh) 2022-10-11

Similar Documents

Publication Publication Date Title
WO2024021394A1 (zh) 全局特征与阶梯型局部特征融合的行人重识别方法及装置
Zhang et al. Visual place recognition: A survey from deep learning perspective
Hou et al. Cross attention network for few-shot classification
CN107480261B (zh) 一种基于深度学习细粒度人脸图像快速检索方法
US20230162522A1 (en) Person re-identification method of integrating global features and ladder-shaped local features and device thereof
CN110209859A (zh) 地点识别及其模型训练的方法和装置以及电子设备
CN112766158A (zh) 基于多任务级联式人脸遮挡表情识别方法
Bazi et al. Bi-modal transformer-based approach for visual question answering in remote sensing imagery
CN111709311A (zh) 一种基于多尺度卷积特征融合的行人重识别方法
CN110349229A (zh) 一种图像描述方法及装置
CN112084895B (zh) 一种基于深度学习的行人重识别方法
Porav et al. Don’t worry about the weather: Unsupervised condition-dependent domain adaptation
CN108229432A (zh) 人脸标定方法及装置
Li et al. Multi-view-based siamese convolutional neural network for 3D object retrieval
US11908222B1 (en) Occluded pedestrian re-identification method based on pose estimation and background suppression
CN116597267B (zh) 图像识别方法、装置、计算机设备和存储介质
CN115222998B (zh) 一种图像分类方法
CN116386079A (zh) 基于元-图感知的领域泛化行人重识别方法及系统
CN116246305A (zh) 一种基于混合部件变换网络的行人检索方法
Tu et al. Toward automatic plant phenotyping: starting from leaf counting
CN113032612B (zh) 一种多目标图像检索模型的构建方法及检索方法和装置
CN110826726B (zh) 目标处理方法、目标处理装置、目标处理设备及介质
CN111931802A (zh) 基于Siamese网络结构融合中层特征的行人重识别方法
CN113128460B (zh) 基于知识蒸馏的多分辨率行人重识别方法
Murtaza et al. TAB: Temporally aggregated bag-of-discriminant-words for temporal action proposals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22952829

Country of ref document: EP

Kind code of ref document: A1