CN115393901A

CN115393901A - Cross-modal pedestrian re-identification method and computer readable storage medium

Info

Publication number: CN115393901A
Application number: CN202211110307.8A
Authority: CN
Inventors: 钟志; 宋雨; 王帮海
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-09-13
Filing date: 2022-09-13
Publication date: 2022-11-25

Abstract

The invention provides a cross-modal pedestrian re-identification method and a computer readable storage medium, wherein the cross-modal pedestrian re-identification method comprises the following steps: acquiring a plurality of modal images of pedestrians, which are different respectively, to form a modal image set; preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; adopting ResNet50 as a convolution network initial model, inputting an image characteristic matrix into the convolution network initial model, and carrying out optimization training on the convolution network initial model to obtain an optimized and trained convolution network model; and acquiring an image of the pedestrian to be identified, inputting the image into the convolution network model after optimization training, and outputting a pedestrian re-identification result. In the convolution network initial model constructed by the invention, the characteristics of different layers and different scales are organically fused, so that the loss of information related to pedestrians in the process of extracting modal shared characteristics is reduced, the pedestrian identification effect is improved, and the method is particularly suitable for the fields of perimeter security, intelligent retrieval and the like.

Description

Cross-modal pedestrian re-identification method and computer readable storage medium

Technical Field

The invention relates to the technical field of computer vision, in particular to a cross-mode pedestrian re-identification method and a computer readable storage medium.

Background

Pedestrian re-identification, also known as pedestrian re-identification, utilizes computer vision techniques to retrieve whether a particular pedestrian is present in an image or video sequence. At present, pedestrian heavy identification mainly relies on the visible light camera, and in poor illumination or night dark environment, the visible light camera can't provide sufficient visual clue about pedestrian's outward appearance information, for solving this problem, the camera that possesses the infrared mode is being widely used.

The visible light image and the infrared image respectively collected in the visible light mode and the infrared mode are two types of different modal data: the visible light image lacks thermal information compared to the infrared image; infrared images lack texture and color information compared to visible light images. The large modal difference, the camera visual angle difference, the illumination difference, the resolution difference, the image shielding and the like of the two have large influence on the cross-modal pedestrian re-identification effect.

The prior art provides methods for solving the above problems based on deep learning, including an image migration method and a modality-shared feature learning method. The image migration is to convert a cross-modal task into a single-modal task, so that the problems of poor quality reliability of a generated pseudo image, high dependence on a training sample and the like exist, and the method cannot be applied to a large-scale monitoring scene; when the modal shared feature learning-based method is used for extracting modal features and eliminating differences among the modes, some important pedestrian distinguishing features are eliminated, and the performance of cross-mode pedestrian re-identification is limited.

Disclosure of Invention

The invention provides a cross-mode pedestrian re-identification method and a computer-readable storage medium for overcoming the defect that more pedestrian features with discriminability are not reserved in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

in a first aspect, a cross-modal pedestrian re-identification method includes the following steps:

s1, acquiring a plurality of modal images of pedestrians, wherein the modal images are different respectively, and forming a modal image set; wherein the set of modal images comprises visible light images and infrared images corresponding to the identity of the pedestrian;

s2, preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; wherein the image characteristic matrix comprises a visible light image characteristic matrix f _rgb And infrared image feature matrix f _ir ；

S3, adopting ResNet50 as a convolution network initial model, wherein the convolution network initial model comprises a plurality of feature extraction block layers and a first residual block layer ₀ Second residual block layer ₁ Third residual block layer ₂ And a fourth residual block layer ₃ (ii) a Inputting the image characteristic matrix into the convolution network initial model, and performing optimization training on the convolution network initial model to obtain an optimized and trained convolution network model; wherein the first residual block layer ₀ Second residual block layer ₁ And a third residual block layer ₂ The extracted hierarchical features are synchronously used for feature fusion compensation;

and S4, acquiring an image of the pedestrian to be recognized, inputting the image into the convolution network model after optimization training, and outputting a pedestrian re-recognition result.

In the technical scheme, the convolution network initial model is constructed based on ResNet50, and the layered features extracted by the shallow network are organically fused to be used as feature compensation, so that the loss of information related to pedestrians in the process of extracting modal shared features is reduced, and a good pedestrian identification effect is obtained.

Preferably, in step S2, the preprocessing operation includes resolution adjustment and data enhancement. The data enhancement mode is that one of a group of preset data enhancement operations is randomly selected each time, and a single image is processed according to the randomly selected enhancement intensity.

Preferably, in step S3, parameters of a plurality of feature extraction block layers are not shared, and the feature extraction block layers include a convolution layer, a batch normalization layer, a non-linear activation function layer, and a maximum pooling layer.

Preferably, in the initial model of the convolutional network, the first residual block layer is ₀ The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;

the second residual block layer ₁ The residual block comprises 4 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected; the third residual block layer ₂ The device comprises 6 residual blocks with the same structure, wherein each residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected; the fourth residual block layer ₃ The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a fourth convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the fourth convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer.

In the preferred scheme, the example normalization operation is introduced into the residual block of the initial convolutional network model constructed based on ResNet50, so that the modal difference is effectively reduced.

As a possible mode of the preferred embodiment, in step S3, the performing optimization training on the initial convolutional network model includes the following steps:

s3.1, dividing the modal image set into a training set for training an optimization model and a test set for evaluating model performance according to the pedestrian identity in a preset proportion, inputting the image feature matrix in the training set into a feature extraction block layer in the convolution network initial model, and respectively extracting image features of corresponding modalities according to modality categories; wherein the number and modality of the feature extraction block layersThe number of categories is correspondingly set, and the visible light image characteristics corresponding to the visible light modality are recorded as F _rgb And the infrared image characteristic corresponding to the infrared mode is recorded as F _ir ；

S3.2, splicing a plurality of image features corresponding to the identity of the pedestrian, and then sequentially carrying out parameter sharing on the first residual block layer ₀ Second residual block layer ₁ Third residual block layer ₂ And a fourth residual block layer ₃ In (1), extracting to obtain trunk characteristics F ₄ And a hierarchical fusion feature;

inputting the spliced image features into a first residual block layer ₀ The first convolution unit in the system respectively enters a batch normalization layer and an example normalization layer after passing through the convolution layer, the output of the batch normalization layer and the output of the example normalization layer are added and then enter a nonlinear activation function layer, and the output of the nonlinear activation function layer sequentially passes through a second convolution unit and a third convolution unit to output a first characteristic diagram F ₁ ；

The first characteristic diagram F ₁ Inputting a second residual block layer ₁ To obtain a second characteristic diagram F ₂ ；

The second feature map F ₂ Inputting a third residual block layer ₂ To obtain a third characteristic diagram F ₃ ；

The third feature map F ₃ Inputting the fourth residual block layer ₃ To obtain a stem feature F ₄ ；

The first characteristic diagram F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ Carrying out organic fusion to construct and obtain the layered fusion characteristics;

s3.3, combining the trunk characteristics F ₄ Stretching the layered fusion characteristics into vectors, splicing the vectors after respectively training by using a cross entropy loss function, and performing optimization training on the convolution network initial model by using a weighted regularized triple loss function to obtain an optimally trained convolution network model;

s3.4, testing the performance of the model by using the test set: when the performance of the convolution network model after the optimization training reaches a preset judgment condition, outputting the convolution network model after the optimization training; and when the performance of the convolution network model after the optimized training can not reach the preset judgment condition, the training set and the test set are divided again, and the steps S1 to S3 are repeated.

Further, in step S3.2, the constructing of the hierarchical fusion feature includes the following steps:

s3.2.1, setting and first characteristic diagram F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ The sizes of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are correspondingly consistent;

s3.2.2, the first feature map F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ Respectively multiplying the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma point by point; the parameters of each position of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are obtained through training;

s3.2.3, adding the three multiplied features to obtain a layered fusion feature.

Preferably, in the step S3.2.3, the first feature map F is used ₁ And a second characteristic diagram F ₂ And the third characteristic diagram F ₃ Before merging, the first feature map F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ Respectively performing down-sampling operation to make the first feature map F ₁ And a second characteristic diagram F ₂ Resolution of and third feature map F ₃ And increase or decrease the first feature map F ₁ The second characteristic diagram F ₂ And the third characteristic diagram F ₃ The number of channels to a preset value.

Further, in step S3.2, the first residual block layer ₀ Second residual block layer ₁ And a third residual block layer ₂ Both of these latter introduce a non-local attention mechanism block for computing the interaction between any two locations in the feature map directly capturing remote dependencies. The non-local attention module enables the model to weight features useful for identifying pedestrians.

Further, in step S3.4, the preset determination condition is set based on the cumulative matching characteristic CMC and the average accuracy mAP.

In a second aspect, a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the cross-modal pedestrian re-identification method proposed in any one of the technical solutions of the first aspect is implemented.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the invention constructs a convolution network initial model based on ResNet50, fuses the characteristics of different layers and different scales, reduces the loss of information related to pedestrians in the process of extracting modal shared characteristics, obtains better pedestrian identification effect and is particularly suitable for the fields of perimeter security protection, intelligent retrieval and the like compared with the prior art.

Drawings

FIG. 1 is a flow chart of a cross-modal pedestrian re-identification method;

FIG. 2 is an overall framework diagram of the initial model of the convolutional network;

FIG. 3 is a diagram of a residual block model in an initial model of a convolutional network;

fig. 4 is a schematic diagram for describing a non-local attention mechanism block structure.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The present embodiment provides a cross-modal pedestrian re-identification method, and as shown in fig. 1, is a flowchart of the cross-modal pedestrian re-identification method of the present embodiment.

The cross-modal pedestrian re-identification method provided by the embodiment comprises the following steps:

s2, preprocessing the modal image set to obtain an image feature matrix classified according to modal categories; wherein the image feature matrix comprises a visible light image feature matrix f _rgb And infrared image feature matrix f _ir ；

S3, adopting ResNet50 as a convolution network initial model, wherein the convolution network initial model comprises a plurality of feature extraction block layers and a first residual block layer ₀ Second residual block layer ₁ Third residual block layer ₂ And a fourth residual block layer ₃ (ii) a Inputting the image characteristic matrix into the convolution network initial model, and performing optimization training on the convolution network initial model to obtain an optimized and trained convolution network model;

wherein the first residual block layer ₀ Second residual block layer ₁ And a third residual block layer ₂ The extracted hierarchical features are synchronously used for feature fusion compensation;

In the overall network model in the embodiment, the improved ResNet50 is used as a reference, the layered features extracted from the shallow network are used as feature compensation, organic fusion is performed, the phenomenon that part of useful features related to pedestrians are lost in the process that different modal images are subjected to modal sharing feature extraction through the residual block is effectively avoided, and a good pedestrian identification effect can be obtained.

In a specific implementation process, an open cross-modal pedestrian re-identification data set SYSU-MM01, which is a large data set comprising 491 pedestrians shot by two infrared cameras and four visible light cameras, is selected and input into a convolution network initial model for experiment.

In an alternative embodiment, the pre-processing operation includes resolution adjustment and data enhancement. The data enhancement mode is that one of a group of preset data enhancement operations is randomly selected each time, and a single image is processed according to the randomly selected enhancement intensity.

In a specific implementation process, a set of data enhancement operations including automatic optimization of image contrast, image flipping, image color adjustment, image cropping, image brightness adjustment, and image sharpness enhancement is optionally preset; optionally, a value range of the enhancement intensity is preset to be 0-30.

In one non-limiting example, the pre-processing operation is performed on the images in the data set SYSU-MM 01: adjusting the resolution of all images in the data set to be 288 x 144, and randomly using a data enhancement mode according to random enhancement intensity for each image; after the data set is preprocessed, the enhanced image can be regarded as one image which is respectively enhanced by using all data enhancement operations, and then the images are uniformly sampled from the image, so that the robustness of the model is improved.

In an optional embodiment, parameters of a plurality of feature extraction block layers in step S3 are not shared, and the feature extraction block layers include a convolution layer, a batch normalization layer, a non-linear activation function layer, and a maximum pooling layer.

Example 2

The embodiment provides a cross-modal pedestrian re-identification method, as shown in fig. 2, which is an overall frame diagram of the initial model of the convolutional network in the embodiment, where L is _id Represents the cross entropy loss, L _wrt Representing a weighted regularized triplet penalty.

In this embodiment, on the cross-modal pedestrian re-identification method provided in embodiment 1, further providing that in the initial model of the convolutional network:

the first residual block layer ₀ The residual block comprises 3 residual blocks with the same structure, wherein each residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and each first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation functionA layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;

the second residual block layer ₁ The residual block comprises 4 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;

the third residual block layer ₂ The residual block comprises 6 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;

the fourth residual block layer ₃ The residual block comprises 3 residual blocks with the same structure, wherein each residual block comprises a fourth convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and each fourth convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit includes a convolution layer and a batch normalization layer.

Fig. 3 is a diagram of a residual block model of the present embodiment, fig. 3 (a) is a diagram of a residual block model without an example normalization layer, and fig. 3 (b) is a diagram of a residual block model after an example normalization layer is introduced. In this embodiment, an example normalization layer is introduced into the residual block of the initial model of the convolutional network, so that the initial model of the convolutional network is not easily affected by the appearance change of the pedestrian, and compared with the prior art, the modal difference between different modal images of the same pedestrian is reduced; in addition, compared with the prior art, the residual block in the fourth residual block layer in this embodiment does not have a downsampling operation, which increases the size of the output.

In an optional embodiment, in the step S3, performing optimization training on the initial convolutional network model includes the following steps:

s3.1, dividing the modal image set into a training set for training an optimization model and a test set for evaluating model performance according to the pedestrian identity in a preset proportion, inputting the image feature matrix in the training set into a feature extraction block layer in the convolution network initial model, and respectively extracting image features of corresponding modalities according to modality categories; the number of the feature extraction block layers is set corresponding to the number of the modal categories, and the visible light image feature corresponding to the visible light modality is recorded as F _rgb And the infrared image characteristic corresponding to the infrared mode is recorded as F _ir ；

S3.2, splicing a plurality of image features corresponding to the identity of the pedestrian, and then sequentially carrying out parameter sharing on the first residual block layer ₀ Second residual block layer ₁ Third residual block layer ₂ And a fourth residual block layer ₃ In the method, the stem feature F is extracted ₄ And a hierarchical fusion feature;

inputting the spliced image features into a first residual block layer ₀ The first convolution unit in the system respectively enters a batch normalization layer and an example normalization layer after being subjected to convolution, the output of the batch normalization layer and the output of the example normalization layer are added and then enter a nonlinear activation function layer, the output of the nonlinear activation function layer sequentially passes through a second convolution unit and a third convolution unit, and a first characteristic diagram F is output ₁ ；

The third feature map F ₃ Inputting the fourth residual block layer ₃ To obtain the stem feature F ₄ ；

s3.3, combining the trunk characteristics F ₄ Stretching the layered fusion characteristics into vectors, splicing the vectors after respectively training by using cross entropy loss functions, and performing optimization training on the initial model of the convolutional network by using a weighted regularized triple loss function to obtain an optimally trained convolutional network model;

s3.4, testing the performance of the model by using the test set: when the performance of the convolution network model after the optimization training reaches a preset judgment condition, outputting the convolution network model after the optimization training; and when the performance of the convolution network model after the optimization training cannot reach the preset judgment condition, re-dividing the training set and the test set, and repeating the steps S1-S3.

In this alternative embodiment, the organic fusion of hierarchical features may be regarded as an attention mechanism, and only features related to the identity of pedestrians are selected for fusion, so as to avoid fusing too many unnecessary features to result in negative model optimization.

s3.2.2, the first feature map F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ Respectively multiplying the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma point by point; the parameters of the positions of the first weight matrix alpha, the second weight matrix beta and the third weight matrix gamma are obtained through training;

In this further embodiment, each weight matrix is multiplied point by point with each of the three feature maps to determine the emphasis and suppression of each pixel in the feature maps,adding the multiplied features to obtain a fused layered feature, i.e., alpha F ₁ +βF ₂ +γF ₃ These features include more discriminating features related to the identity of the person. Fusing hierarchical fusion features with stem features, i.e. F ₄ +αF ₁ +βF ₂ +γF ₃ The method can reduce the useful features related to the pedestrians to the maximum extent and eliminate the useful features in the feature extraction process, and improve the utilization rate of the image features.

In a non-limiting example, the training process of the parameter values of the positions of the first weight matrix α, the second weight matrix β and the third weight matrix γ includes: and obtaining the gradient value of each parameter through back propagation calculation, and then updating the parameters by using a gradient descent algorithm.

In a specific implementation process, an image in a data set SYSU-MM01 is used as an input image for optimization training of the initial model of the convolutional network, and in the step S3.2.3, the first feature map F is used ₁ And a second characteristic diagram F ₂ And the third characteristic diagram F ₃ Before merging, the first feature map F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ Respectively performing down-sampling operation to make the first feature map F ₁ And a second characteristic diagram F ₂ Resolution of and third feature map F ₃ The number of parameters is reduced, and the first characteristic diagram F is used ₁ The second characteristic diagram F ₂ And the third characteristic diagram F ₃ The number of channels of (a) is reduced to 256, without limitation.

Fig. 4 is a schematic diagram of a non-local attention mechanism block structure, and further, in the step S3.2, the first residual block layer ₀ Second residual block layer ₁ And a third residual block layer ₂ Both of these latter introduce a non-local attention mechanism block for computing the interaction between any two locations in the feature map directly capturing remote dependencies. The non-local attention mechanism block (short for 'No-local') enables the model to improve the weight of features useful for identifying pedestrians without being limited to adjacent points, and is equivalent to constructing a convolution kernel as large as the size of a feature map, expanding the receptive field and maintaining more information.

Further, in the step S3.4, the preset determination condition is set based on the cumulative matching characteristic CMC and the average precision mAP.

The cumulative matching feature CMC is used for evaluating the accuracy of similarity ranking in the closed set, and is represented by calculating the ratio of correct matching results contained in the first n retrieval results by inquiring pedestrians to be re-identified in the candidate library.

In a specific implementation process, the model is trained 150 times, and the test set is used for testing once every 2 times of training, namely, the cumulative matching characteristic CMC and the average precision mAP of the test set are calculated, and the model which enables the best effect of the cumulative matching characteristic CMC of the test set is found out.

In a non-limiting example, for a given pedestrian image to be retrieved in the test set, image features are obtained through model extraction, and then sorting is performed from small to large according to the distances between the pedestrian image features to be retrieved and the pedestrian image features in all the candidate libraries. The probability of the correct value contained in the sorting result is set as Rank, rank-n represents the ratio of correct images contained in the first n images with the highest reliability in the retrieval result, and the calculation formula is as follows:

wherein M represents the number of correct images contained in the first n recognition results; g denotes the number of pedestrian images to be queried.

In one non-limiting example, the average precision mAP expression is:

wherein G represents the query times, AP represents the average Precision, and the average Precision is obtained by averaging the Precision.

The Precision ratio Precision expression is as follows:

wherein TP represents the number of positive samples for which prediction is correct; FP represents the number of positive samples with prediction error; m represents the number of co-tags in the candidate library for the search image.

Example 3

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, causes the processor to execute the cross-modal pedestrian re-identification method proposed in embodiment 1 or embodiment 2 above.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the terms "first," "second," and the like as used in the description and in the claims, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Also, the use of the terms "a" or "an" and the like do not denote a limitation of quantity, but rather denote the presence of at least one.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A cross-mode pedestrian re-identification method is characterized by comprising the following steps:

2. The cross-modal pedestrian re-identification method according to claim 1, wherein in the step S2, the preprocessing operation includes resolution adjustment and data enhancement; the data enhancement mode is that one of a group of preset data enhancement operations is randomly selected each time, and a single image is processed according to the randomly selected enhancement intensity.

3. The method according to claim 1, wherein parameters of a plurality of feature extraction block layers in step S3 are not shared, and the feature extraction block layers include a convolutional layer, a batch normalization layer, a nonlinear activation function layer, and a maximum pooling layer.

4. The method as claimed in claim 1, wherein in the initial model of the convolutional network, the first residual block layer is used as the initial model of the convolutional network ₀ The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the first convolution unit comprises a convolution layer, a batch normalization layer, an example normalization layer and a nonlinear activation function layer; the second convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer; the third convolution unit comprises a convolution layer and a batch normalization layer;

the second residual block layer ₁ The residual block comprises 4 residual blocks with the same structure, wherein the residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected;

the third residual block layer ₂ The residual block comprises 6 residual blocks with the same structure, wherein each residual block comprises a first convolution unit, a second convolution unit and a third convolution unit which are sequentially connected;

the fourth residual block layer ₃ The residual block comprises 3 residual blocks with the same structure, wherein the residual block comprises a fourth convolution unit, a second convolution unit and a third convolution unit which are sequentially connected, and the fourth convolution unit comprises a convolution layer, a batch normalization layer and a nonlinear activation function layer.

5. The cross-modal pedestrian re-recognition method according to claim 4, wherein in the step S3, the optimization training of the initial convolutional network model comprises the following steps:

s3.1, dividing the modal image set into a training set for training an optimization model and a test set for evaluating model performance according to the pedestrian identity in a preset proportion, inputting the image feature matrix in the training set into a feature extraction block layer in the convolution network initial model, and respectively extracting image features of corresponding modalities according to modality categories; wherein the characteristics areThe number of the extraction block layers is set corresponding to the number of the modal categories, and the visible light image characteristic corresponding to the visible light modality is recorded as F _rgb And the infrared image characteristic corresponding to the infrared mode is recorded as F _ir ；

S3.2, splicing a plurality of image features corresponding to the identity of the pedestrian, and then sequentially performing parameter sharing on the first residual block layer ₀ Second residual block layer ₁ Third residual block layer ₂ And a fourth residual block layer ₃ In the method, the stem feature F is extracted ₄ And a hierarchical fusion feature;

The second characteristic diagram F ₂ Inputting a third residual block layer ₂ To obtain a third characteristic diagram F ₃ ；

6. The method according to claim 5, wherein in the step S3.2, the construction of the hierarchical fusion features comprises the following steps:

7. The method according to claim 6, wherein in the step S3.2.3, the first feature map F is used ₁ And a second characteristic diagram F ₂ And the third characteristic diagram F ₃ Before merging, the first feature map F ₁ The second characteristic diagram F ₂ And a third characteristic diagram F ₃ Respectively performing down-sampling operation to make the first feature map F ₁ And a second characteristic diagram F ₂ Resolution of and third feature map F ₃ And increase or decrease the first characteristic diagram F ₁ The second characteristic diagram F ₂ And the third characteristic diagram F ₃ The number of channels to a preset value.

8. The method as claimed in claim 5, wherein in step S3.2, the first residual block layer ₀ Second residual block layer ₁ And a third residual block layer ₂ Both of these latter introduce a non-local attention mechanism block for computing the interaction between any two locations in the feature map directly capturing remote dependencies.

9. A cross-modal pedestrian re-identification method according to claim 5, wherein in the step S3.4, the preset determination condition is set based on an accumulated matching feature (CMC) and an average precision (mAP).

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the cross-modal pedestrian re-identification method according to any one of claims 1 to 9.