CN115393901A - Cross-modal pedestrian re-identification method and computer readable storage medium - Google Patents

Cross-modal pedestrian re-identification method and computer readable storage medium Download PDF

Info

Publication number
CN115393901A
CN115393901A CN202211110307.8A CN202211110307A CN115393901A CN 115393901 A CN115393901 A CN 115393901A CN 202211110307 A CN202211110307 A CN 202211110307A CN 115393901 A CN115393901 A CN 115393901A
Authority
CN
China
Prior art keywords
layer
residual block
convolution
image
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211110307.8A
Other languages
Chinese (zh)
Inventor
钟志
宋雨
王帮海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202211110307.8A priority Critical patent/CN115393901A/en
Publication of CN115393901A publication Critical patent/CN115393901A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a cross-modal pedestrian re-identification method and a computer readable storage medium, wherein the cross-modal pedestrian re-identification method comprises the following steps: acquiring a plurality of modal images of pedestrians, which are different respectively, to form a modal image set; preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; adopting ResNet50 as a convolution network initial model, inputting an image characteristic matrix into the convolution network initial model, and carrying out optimization training on the convolution network initial model to obtain an optimized and trained convolution network model; and acquiring an image of the pedestrian to be identified, inputting the image into the convolution network model after optimization training, and outputting a pedestrian re-identification result. In the convolution network initial model constructed by the invention, the characteristics of different layers and different scales are organically fused, so that the loss of information related to pedestrians in the process of extracting modal shared characteristics is reduced, the pedestrian identification effect is improved, and the method is particularly suitable for the fields of perimeter security, intelligent retrieval and the like.

Description

Cross-modal pedestrian re-identification method and computer readable storage medium
Technical Field
The invention relates to the technical field of computer vision, in particular to a cross-mode pedestrian re-identification method and a computer readable storage medium.
Background
Pedestrian re-identification, also known as pedestrian re-identification, utilizes computer vision techniques to retrieve whether a particular pedestrian is present in an image or video sequence. At present, pedestrian heavy identification mainly relies on the visible light camera, and in poor illumination or night dark environment, the visible light camera can't provide sufficient visual clue about pedestrian's outward appearance information, for solving this problem, the camera that possesses the infrared mode is being widely used.
The visible light image and the infrared image respectively collected in the visible light mode and the infrared mode are two types of different modal data: the visible light image lacks thermal information compared to the infrared image; infrared images lack texture and color information compared to visible light images. The large modal difference, the camera visual angle difference, the illumination difference, the resolution difference, the image shielding and the like of the two have large influence on the cross-modal pedestrian re-identification effect.
The prior art provides methods for solving the above problems based on deep learning, including an image migration method and a modality-shared feature learning method. The image migration is to convert a cross-modal task into a single-modal task, so that the problems of poor quality reliability of a generated pseudo image, high dependence on a training sample and the like exist, and the method cannot be applied to a large-scale monitoring scene; when the modal shared feature learning-based method is used for extracting modal features and eliminating differences among the modes, some important pedestrian distinguishing features are eliminated, and the performance of cross-mode pedestrian re-identification is limited.
Disclosure of Invention
The invention provides a cross-mode pedestrian re-identification method and a computer-readable storage medium for overcoming the defect that more pedestrian features with discriminability are not reserved in the prior art.
In order to solve the technical problems, the technical scheme of the invention is as follows:
in a first aspect, a cross-modal pedestrian re-identification method includes the following steps:
s1, acquiring a plurality of modal images of pedestrians, wherein the modal images are different respectively, and forming a modal image set; wherein the set of modal images comprises visible light images and infrared images corresponding to the identity of the pedestrian;
s2, preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; wherein the image characteristic matrix comprises a visible light image characteristic matrix f rgb And infrared image feature matrix f ir
S3, adopting ResNet50 as a convolution network initial model, wherein the convolution network initial model comprises a plurality of feature extraction block layers and a first residual block layer 0 Second residual block layer 1 Third residual block layer 2 And a fourth residual block layer 3 (ii) a Inputting the image characteristic matrix into the convolution network initial model, and performing optimization training on the convolution network initial model to obtain an optimized and trained convolution network model; wherein the first residual block layer 0 Second residual block layer 1 And a third residual block layer 2 The extracted hierarchical features are synchronously used for feature fusion compensation;
and S4, acquiring an image of the pedestrian to be recognized, inputting the image into the convolution network model after optimization training, and outputting a pedestrian re-recognition result.
In the technical scheme, the convolution network initial model is constructed based on ResNet50, and the layered features extracted by the shallow network are organically fused to be used as feature compensation, so that the loss of information related to pedestrians in the process of extracting modal shared features is reduced, and a good pedestrian identification effect is obtained.
Preferably, in step S2, the preprocessing operation includes resolution adjustment and data enhancement. The data enhancement mode is that one of a group of preset data enhancement operations is randomly selected each time, and a single image is processed according to the randomly selected enhancement intensity.
Preferably, in step S3, parameters of a plurality of feature extraction block layers are not shared, and the feature extraction block layers include a convolution layer, a batch normalization layer, a non-linear activation function layer, and a maximum pooling layer.
Preferably, in the initial model of the convolutional network, the first residual block layer is 0 The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;
the second residual block layer 1 The residual block comprises 4 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected; the third residual block layer 2 The device comprises 6 residual blocks with the same structure, wherein each residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected; the fourth residual block layer 3 The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a fourth convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the fourth convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer.
In the preferred scheme, the example normalization operation is introduced into the residual block of the initial convolutional network model constructed based on ResNet50, so that the modal difference is effectively reduced.
As a possible mode of the preferred embodiment, in step S3, the performing optimization training on the initial convolutional network model includes the following steps:
s3.1, dividing the modal image set into a training set for training an optimization model and a test set for evaluating model performance according to the pedestrian identity in a preset proportion, inputting the image feature matrix in the training set into a feature extraction block layer in the convolution network initial model, and respectively extracting image features of corresponding modalities according to modality categories; wherein the number and modality of the feature extraction block layersThe number of categories is correspondingly set, and the visible light image characteristics corresponding to the visible light modality are recorded as F rgb And the infrared image characteristic corresponding to the infrared mode is recorded as F ir
S3.2, splicing a plurality of image features corresponding to the identity of the pedestrian, and then sequentially carrying out parameter sharing on the first residual block layer 0 Second residual block layer 1 Third residual block layer 2 And a fourth residual block layer 3 In (1), extracting to obtain trunk characteristics F 4 And a hierarchical fusion feature;
inputting the spliced image features into a first residual block layer 0 The first convolution unit in the system respectively enters a batch normalization layer and an example normalization layer after passing through the convolution layer, the output of the batch normalization layer and the output of the example normalization layer are added and then enter a nonlinear activation function layer, and the output of the nonlinear activation function layer sequentially passes through a second convolution unit and a third convolution unit to output a first characteristic diagram F 1
The first characteristic diagram F 1 Inputting a second residual block layer 1 To obtain a second characteristic diagram F 2
The second feature map F 2 Inputting a third residual block layer 2 To obtain a third characteristic diagram F 3
The third feature map F 3 Inputting the fourth residual block layer 3 To obtain a stem feature F 4
The first characteristic diagram F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Carrying out organic fusion to construct and obtain the layered fusion characteristics;
s3.3, combining the trunk characteristics F 4 Stretching the layered fusion characteristics into vectors, splicing the vectors after respectively training by using a cross entropy loss function, and performing optimization training on the convolution network initial model by using a weighted regularized triple loss function to obtain an optimally trained convolution network model;
s3.4, testing the performance of the model by using the test set: when the performance of the convolution network model after the optimization training reaches a preset judgment condition, outputting the convolution network model after the optimization training; and when the performance of the convolution network model after the optimized training can not reach the preset judgment condition, the training set and the test set are divided again, and the steps S1 to S3 are repeated.
Further, in step S3.2, the constructing of the hierarchical fusion feature includes the following steps:
s3.2.1, setting and first characteristic diagram F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 The sizes of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are correspondingly consistent;
s3.2.2, the first feature map F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Respectively multiplying the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma point by point; the parameters of each position of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are obtained through training;
s3.2.3, adding the three multiplied features to obtain a layered fusion feature.
Preferably, in the step S3.2.3, the first feature map F is used 1 And a second characteristic diagram F 2 And the third characteristic diagram F 3 Before merging, the first feature map F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Respectively performing down-sampling operation to make the first feature map F 1 And a second characteristic diagram F 2 Resolution of and third feature map F 3 And increase or decrease the first feature map F 1 The second characteristic diagram F 2 And the third characteristic diagram F 3 The number of channels to a preset value.
Further, in step S3.2, the first residual block layer 0 Second residual block layer 1 And a third residual block layer 2 Both of these latter introduce a non-local attention mechanism block for computing the interaction between any two locations in the feature map directly capturing remote dependencies. The non-local attention module enables the model to weight features useful for identifying pedestrians.
Further, in step S3.4, the preset determination condition is set based on the cumulative matching characteristic CMC and the average accuracy mAP.
In a second aspect, a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the cross-modal pedestrian re-identification method proposed in any one of the technical solutions of the first aspect is implemented.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the invention constructs a convolution network initial model based on ResNet50, fuses the characteristics of different layers and different scales, reduces the loss of information related to pedestrians in the process of extracting modal shared characteristics, obtains better pedestrian identification effect and is particularly suitable for the fields of perimeter security protection, intelligent retrieval and the like compared with the prior art.
Drawings
FIG. 1 is a flow chart of a cross-modal pedestrian re-identification method;
FIG. 2 is an overall framework diagram of the initial model of the convolutional network;
FIG. 3 is a diagram of a residual block model in an initial model of a convolutional network;
fig. 4 is a schematic diagram for describing a non-local attention mechanism block structure.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The present embodiment provides a cross-modal pedestrian re-identification method, and as shown in fig. 1, is a flowchart of the cross-modal pedestrian re-identification method of the present embodiment.
The cross-modal pedestrian re-identification method provided by the embodiment comprises the following steps:
s1, acquiring a plurality of modal images of pedestrians, wherein the modal images are different respectively, and forming a modal image set; wherein the set of modal images comprises visible light images and infrared images corresponding to the identity of the pedestrian;
s2, preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; wherein the image feature matrix comprises a visible light image feature matrix f rgb And infrared image feature matrix f ir
S3, adopting ResNet50 as a convolution network initial model, wherein the convolution network initial model comprises a plurality of feature extraction block layers and a first residual block layer 0 Second residual block layer 1 Third residual block layer 2 And a fourth residual block layer 3 (ii) a Inputting the image characteristic matrix into the convolution network initial model, and performing optimization training on the convolution network initial model to obtain an optimized and trained convolution network model;
wherein the first residual block layer 0 Second residual block layer 1 And a third residual block layer 2 The extracted hierarchical features are synchronously used for feature fusion compensation;
and S4, acquiring an image of the pedestrian to be recognized, inputting the image into the convolution network model after optimization training, and outputting a pedestrian re-recognition result.
In the overall network model in the embodiment, the improved ResNet50 is used as a reference, the layered features extracted from the shallow network are used as feature compensation, organic fusion is performed, the phenomenon that part of useful features related to pedestrians are lost in the process that different modal images are subjected to modal sharing feature extraction through the residual block is effectively avoided, and a good pedestrian identification effect can be obtained.
In a specific implementation process, an open cross-modal pedestrian re-identification data set SYSU-MM01, which is a large data set comprising 491 pedestrians shot by two infrared cameras and four visible light cameras, is selected and input into a convolution network initial model for experiment.
In an alternative embodiment, the pre-processing operation includes resolution adjustment and data enhancement. The data enhancement mode is that one of a group of preset data enhancement operations is randomly selected each time, and a single image is processed according to the randomly selected enhancement intensity.
In a specific implementation process, a set of data enhancement operations including automatic optimization of image contrast, image flipping, image color adjustment, image cropping, image brightness adjustment, and image sharpness enhancement is optionally preset; optionally, a value range of the enhancement intensity is preset to be 0-30.
In one non-limiting example, the pre-processing operation is performed on the images in the data set SYSU-MM 01: adjusting the resolution of all images in the data set to be 288 x 144, and randomly using a data enhancement mode according to random enhancement intensity for each image; after the data set is preprocessed, the enhanced image can be regarded as one image which is respectively enhanced by using all data enhancement operations, and then the images are uniformly sampled from the image, so that the robustness of the model is improved.
In an optional embodiment, parameters of a plurality of feature extraction block layers in step S3 are not shared, and the feature extraction block layers include a convolution layer, a batch normalization layer, a non-linear activation function layer, and a maximum pooling layer.
Example 2
The embodiment provides a cross-modal pedestrian re-identification method, as shown in fig. 2, which is an overall frame diagram of the initial model of the convolutional network in the embodiment, where L is id Represents the cross entropy loss, L wrt Representing a weighted regularized triplet penalty.
In this embodiment, on the cross-modal pedestrian re-identification method provided in embodiment 1, further providing that in the initial model of the convolutional network:
the first residual block layer 0 The residual block comprises 3 residual blocks with the same structure, wherein each residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and each first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation functionA layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;
the second residual block layer 1 The residual block comprises 4 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;
the third residual block layer 2 The residual block comprises 6 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;
the fourth residual block layer 3 The residual block comprises 3 residual blocks with the same structure, wherein each residual block comprises a fourth convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and each fourth convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit includes a convolution layer and a batch normalization layer.
Fig. 3 is a diagram of a residual block model of the present embodiment, fig. 3 (a) is a diagram of a residual block model without an example normalization layer, and fig. 3 (b) is a diagram of a residual block model after an example normalization layer is introduced. In this embodiment, an example normalization layer is introduced into the residual block of the initial model of the convolutional network, so that the initial model of the convolutional network is not easily affected by the appearance change of the pedestrian, and compared with the prior art, the modal difference between different modal images of the same pedestrian is reduced; in addition, compared with the prior art, the residual block in the fourth residual block layer in this embodiment does not have a downsampling operation, which increases the size of the output.
In an optional embodiment, in the step S3, performing optimization training on the initial convolutional network model includes the following steps:
s3.1, dividing the modal image set into a training set for training an optimization model and a test set for evaluating model performance according to the pedestrian identity in a preset proportion, inputting the image feature matrix in the training set into a feature extraction block layer in the convolution network initial model, and respectively extracting image features of corresponding modalities according to modality categories; the number of the feature extraction block layers is set corresponding to the number of the modal categories, and the visible light image feature corresponding to the visible light modality is recorded as F rgb And the infrared image characteristic corresponding to the infrared mode is recorded as F ir
S3.2, splicing a plurality of image features corresponding to the identity of the pedestrian, and then sequentially carrying out parameter sharing on the first residual block layer 0 Second residual block layer 1 Third residual block layer 2 And a fourth residual block layer 3 In the method, the stem feature F is extracted 4 And a hierarchical fusion feature;
inputting the spliced image features into a first residual block layer 0 The first convolution unit in the system respectively enters a batch normalization layer and an example normalization layer after being subjected to convolution, the output of the batch normalization layer and the output of the example normalization layer are added and then enter a nonlinear activation function layer, the output of the nonlinear activation function layer sequentially passes through a second convolution unit and a third convolution unit, and a first characteristic diagram F is output 1
The first characteristic diagram F 1 Inputting a second residual block layer 1 To obtain a second characteristic diagram F 2
The second feature map F 2 Inputting a third residual block layer 2 To obtain a third characteristic diagram F 3
The third feature map F 3 Inputting the fourth residual block layer 3 To obtain the stem feature F 4
The first characteristic diagram F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Carrying out organic fusion to construct and obtain the layered fusion characteristics;
s3.3, combining the trunk characteristics F 4 Stretching the layered fusion characteristics into vectors, splicing the vectors after respectively training by using cross entropy loss functions, and performing optimization training on the initial model of the convolutional network by using a weighted regularized triple loss function to obtain an optimally trained convolutional network model;
s3.4, testing the performance of the model by using the test set: when the performance of the convolution network model after the optimization training reaches a preset judgment condition, outputting the convolution network model after the optimization training; and when the performance of the convolution network model after the optimization training cannot reach the preset judgment condition, re-dividing the training set and the test set, and repeating the steps S1-S3.
In this alternative embodiment, the organic fusion of hierarchical features may be regarded as an attention mechanism, and only features related to the identity of pedestrians are selected for fusion, so as to avoid fusing too many unnecessary features to result in negative model optimization.
Further, in step S3.2, the constructing of the hierarchical fusion feature includes the following steps:
s3.2.1, setting and first characteristic diagram F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 The sizes of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are correspondingly consistent;
s3.2.2, the first feature map F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Respectively multiplying the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma point by point; the parameters of the positions of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are obtained through training;
s3.2.3, adding the three multiplied features to obtain a layered fusion feature.
In this further embodiment, each weight matrix is multiplied point by point with each of the three feature maps to determine the emphasis and suppression of each pixel in the feature maps,adding the multiplied features to obtain a fused layered feature, i.e., alpha F 1 +βF 2 +γF 3 These features include more discriminating features related to the identity of the person. Fusing hierarchical fusion features with stem features, i.e. F 4 +αF 1 +βF 2 +γF 3 The method can reduce the useful features related to the pedestrians to the maximum extent and eliminate the useful features in the feature extraction process, and improve the utilization rate of the image features.
In a non-limiting example, the training process of the parameter values of the positions of the first weight matrix α, the second weight matrix β and the third weight matrix γ includes: and obtaining the gradient value of each parameter through back propagation calculation, and then updating the parameters by using a gradient descent algorithm.
In a specific implementation process, an image in a data set SYSU-MM01 is used as an input image for optimization training of the initial model of the convolutional network, and in the step S3.2.3, the first feature map F is used 1 And a second characteristic diagram F 2 And the third characteristic diagram F 3 Before merging, the first feature map F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Respectively performing down-sampling operation to make the first feature map F 1 And a second characteristic diagram F 2 Resolution of and third feature map F 3 The number of parameters is reduced, and the first characteristic diagram F is used 1 The second characteristic diagram F 2 And the third characteristic diagram F 3 The number of channels of (a) is reduced to 256, without limitation.
Fig. 4 is a schematic diagram of a non-local attention mechanism block structure, and further, in the step S3.2, the first residual block layer 0 Second residual block layer 1 And a third residual block layer 2 Both of these latter introduce a non-local attention mechanism block for computing the interaction between any two locations in the feature map directly capturing remote dependencies. The non-local attention mechanism block (short for 'No-local') enables the model to improve the weight of features useful for identifying pedestrians without being limited to adjacent points, and is equivalent to constructing a convolution kernel as large as the size of a feature map, expanding the receptive field and maintaining more information.
Further, in the step S3.4, the preset determination condition is set based on the cumulative matching characteristic CMC and the average precision mAP.
The cumulative matching feature CMC is used for evaluating the accuracy of similarity ranking in the closed set, and is represented by calculating the ratio of correct matching results contained in the first n retrieval results by inquiring pedestrians to be re-identified in the candidate library.
In a specific implementation process, the model is trained 150 times, and the test set is used for testing once every 2 times of training, namely, the cumulative matching characteristic CMC and the average precision mAP of the test set are calculated, and the model which enables the best effect of the cumulative matching characteristic CMC of the test set is found out.
In a non-limiting example, for a given pedestrian image to be retrieved in the test set, image features are obtained through model extraction, and then sorting is performed from small to large according to the distances between the pedestrian image features to be retrieved and the pedestrian image features in all the candidate libraries. The probability of the correct value contained in the sorting result is set as Rank, rank-n represents the ratio of correct images contained in the first n images with the highest reliability in the retrieval result, and the calculation formula is as follows:
Figure BDA0003843799450000091
wherein M represents the number of correct images contained in the first n recognition results; g denotes the number of pedestrian images to be queried.
In one non-limiting example, the average precision mAP expression is:
Figure BDA0003843799450000101
wherein G represents the query times, AP represents the average Precision, and the average Precision is obtained by averaging the Precision.
The Precision ratio Precision expression is as follows:
Figure BDA0003843799450000102
Figure BDA0003843799450000103
wherein TP represents the number of positive samples for which prediction is correct; FP represents the number of positive samples with prediction error; m represents the number of co-tags in the candidate library for the search image.
Example 3
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, causes the processor to execute the cross-modal pedestrian re-identification method proposed in embodiment 1 or embodiment 2 above.
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the terms "first," "second," and the like as used in the description and in the claims, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Also, the use of the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A cross-mode pedestrian re-identification method is characterized by comprising the following steps:
s1, acquiring a plurality of modal images of pedestrians, wherein the modal images are different respectively, and forming a modal image set; wherein the set of modal images comprises visible light images and infrared images corresponding to the identity of the pedestrian;
s2, preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; wherein the image characteristic matrix comprises a visible light image characteristic matrix f rgb And infrared image feature matrix f ir
S3, adopting ResNet50 as a convolution network initial model, wherein the convolution network initial model comprises a plurality of feature extraction block layers and a first residual block layer 0 Second residual block layer 1 Third residual block layer 2 And a fourth residual block layer 3 (ii) a Inputting the image characteristic matrix into the convolution network initial model, and performing optimization training on the convolution network initial model to obtain an optimized and trained convolution network model;
wherein the first residual block layer 0 Second residual block layer 1 And a third residual block layer 2 The extracted hierarchical features are synchronously used for feature fusion compensation;
and S4, acquiring an image of the pedestrian to be recognized, inputting the image into the convolution network model after optimization training, and outputting a pedestrian re-recognition result.
2. The cross-modal pedestrian re-identification method according to claim 1, wherein in the step S2, the preprocessing operation includes resolution adjustment and data enhancement; the data enhancement mode is that one of a group of preset data enhancement operations is randomly selected each time, and a single image is processed according to the randomly selected enhancement intensity.
3. The method according to claim 1, wherein parameters of a plurality of feature extraction block layers in step S3 are not shared, and the feature extraction block layers include a convolutional layer, a batch normalization layer, a nonlinear activation function layer, and a maximum pooling layer.
4. The method as claimed in claim 1, wherein in the initial model of the convolutional network, the first residual block layer is used as the initial model of the convolutional network 0 The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;
the second residual block layer 1 The residual block comprises 4 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected;
the third residual block layer 2 The residual block comprises 6 residual blocks with the same structure, wherein each residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected;
the fourth residual block layer 3 The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a fourth convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the fourth convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer.
5. The cross-modal pedestrian re-recognition method according to claim 4, wherein in the step S3, the optimization training of the initial convolutional network model comprises the following steps:
s3.1, dividing the modal image set into a training set for training an optimization model and a test set for evaluating model performance according to the pedestrian identity in a preset proportion, inputting the image feature matrix in the training set into a feature extraction block layer in the convolution network initial model, and respectively extracting image features of corresponding modalities according to modality categories; wherein the characteristics areThe number of the extraction block layers is set corresponding to the number of the modal categories, and the visible light image characteristic corresponding to the visible light modality is recorded as F rgb And the infrared image characteristic corresponding to the infrared mode is recorded as F ir
S3.2, splicing a plurality of image features corresponding to the identity of the pedestrian, and then sequentially performing parameter sharing on the first residual block layer 0 Second residual block layer 1 Third residual block layer 2 And a fourth residual block layer 3 In the method, the stem feature F is extracted 4 And a hierarchical fusion feature;
inputting the spliced image features into a first residual block layer 0 The first convolution unit in the system respectively enters a batch normalization layer and an example normalization layer after being subjected to convolution, the output of the batch normalization layer and the output of the example normalization layer are added and then enter a nonlinear activation function layer, the output of the nonlinear activation function layer sequentially passes through a second convolution unit and a third convolution unit, and a first characteristic diagram F is output 1
The first characteristic diagram F 1 Inputting a second residual block layer 1 To obtain a second characteristic diagram F 2
The second characteristic diagram F 2 Inputting a third residual block layer 2 To obtain a third characteristic diagram F 3
The third feature map F 3 Inputting the fourth residual block layer 3 To obtain a stem feature F 4
The first characteristic diagram F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Carrying out organic fusion to construct and obtain the layered fusion characteristics;
s3.3, combining the trunk characteristics F 4 Stretching the layered fusion characteristics into vectors, splicing the vectors after respectively training by using cross entropy loss functions, and performing optimization training on the initial model of the convolutional network by using a weighted regularized triple loss function to obtain an optimally trained convolutional network model;
s3.4, testing the performance of the model by using the test set: when the performance of the convolution network model after the optimization training reaches a preset judgment condition, outputting the convolution network model after the optimization training; and when the performance of the convolution network model after the optimized training can not reach the preset judgment condition, the training set and the test set are divided again, and the steps S1 to S3 are repeated.
6. The method according to claim 5, wherein in the step S3.2, the construction of the hierarchical fusion features comprises the following steps:
s3.2.1, setting and first characteristic diagram F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 The sizes of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are correspondingly consistent;
s3.2.2, the first feature map F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Respectively multiplying the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma point by point; the parameters of each position of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are obtained through training;
s3.2.3, adding the three multiplied features to obtain a layered fusion feature.
7. The method according to claim 6, wherein in the step S3.2.3, the first feature map F is used 1 And a second characteristic diagram F 2 And the third characteristic diagram F 3 Before merging, the first feature map F 1 The second characteristic diagram F 2 And a third characteristic diagram F 3 Respectively performing down-sampling operation to make the first feature map F 1 And a second characteristic diagram F 2 Resolution of and third feature map F 3 And increase or decrease the first characteristic diagram F 1 The second characteristic diagram F 2 And the third characteristic diagram F 3 The number of channels to a preset value.
8. The method as claimed in claim 5, wherein in step S3.2, the first residual block layer 0 Second residual block layer 1 And a third residual block layer 2 Both of these latter introduce a non-local attention mechanism block for computing the interaction between any two locations in the feature map directly capturing remote dependencies.
9. A cross-modal pedestrian re-identification method according to claim 5, wherein in the step S3.4, the preset determination condition is set based on an accumulated matching feature (CMC) and an average precision (mAP).
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the cross-modal pedestrian re-identification method according to any one of claims 1 to 9.
CN202211110307.8A 2022-09-13 2022-09-13 Cross-modal pedestrian re-identification method and computer readable storage medium Pending CN115393901A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211110307.8A CN115393901A (en) 2022-09-13 2022-09-13 Cross-modal pedestrian re-identification method and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211110307.8A CN115393901A (en) 2022-09-13 2022-09-13 Cross-modal pedestrian re-identification method and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115393901A true CN115393901A (en) 2022-11-25

Family

ID=84125698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211110307.8A Pending CN115393901A (en) 2022-09-13 2022-09-13 Cross-modal pedestrian re-identification method and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115393901A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117994822A (en) * 2024-04-07 2024-05-07 南京信息工程大学 Cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117994822A (en) * 2024-04-07 2024-05-07 南京信息工程大学 Cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN111539370B (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN108399362B (en) Rapid pedestrian detection method and device
CN108052911B (en) Deep learning-based multi-mode remote sensing image high-level feature fusion classification method
CN111767882A (en) Multi-mode pedestrian detection method based on improved YOLO model
CN111178183B (en) Face detection method and related device
CN110503076B (en) Video classification method, device, equipment and medium based on artificial intelligence
CN113077491B (en) RGBT target tracking method based on cross-modal sharing and specific representation form
CN112767645B (en) Smoke identification method and device and electronic equipment
CN112507853B (en) Cross-modal pedestrian re-recognition method based on mutual attention mechanism
CN114972976B (en) Night target detection and training method and device based on frequency domain self-attention mechanism
CN116311254B (en) Image target detection method, system and equipment under severe weather condition
CN115311186B (en) Cross-scale attention confrontation fusion method and terminal for infrared and visible light images
CN115131640A (en) Target detection method and system utilizing illumination guide and attention mechanism
Malav et al. DHSGAN: An end to end dehazing network for fog and smoke
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN115393901A (en) Cross-modal pedestrian re-identification method and computer readable storage medium
CN114170422A (en) Coal mine underground image semantic segmentation method
Zhang et al. Deep joint neural model for single image haze removal and color correction
CN111178370B (en) Vehicle searching method and related device
CN116542865A (en) Multi-scale real-time defogging method and device based on structural re-parameterization
CN112633089B (en) Video pedestrian re-identification method, intelligent terminal and storage medium
CN116958615A (en) Picture identification method, device, equipment and medium
CN114581769A (en) Method for identifying houses under construction based on unsupervised clustering
CN113642353B (en) Training method of face detection model, storage medium and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination