CN114882525A - Cross-modal pedestrian re-identification method based on modal specific memory network - Google Patents

Cross-modal pedestrian re-identification method based on modal specific memory network Download PDF

Info

Publication number
CN114882525A
CN114882525A CN202210426984.4A CN202210426984A CN114882525A CN 114882525 A CN114882525 A CN 114882525A CN 202210426984 A CN202210426984 A CN 202210426984A CN 114882525 A CN114882525 A CN 114882525A
Authority
CN
China
Prior art keywords
infrared
visible light
modal
pedestrian
reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210426984.4A
Other languages
Chinese (zh)
Other versions
CN114882525B (en
Inventor
张天柱
刘翔
张勇东
李昱霖
吴枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210426984.4A priority Critical patent/CN114882525B/en
Publication of CN114882525A publication Critical patent/CN114882525A/en
Application granted granted Critical
Publication of CN114882525B publication Critical patent/CN114882525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention provides a cross-modal pedestrian re-identification method based on a modal specific memory network, which comprises the following steps: acquiring a pedestrian image to be re-identified and a re-identification type; and processing the pedestrian image to be re-identified by using a cross-modal pedestrian re-identification model based on a modal specific memory network according to the re-identification type to obtain a re-identification result. The invention also provides electronic equipment, a storage medium and a computer program product for realizing the cross-modal pedestrian re-identification method based on the modal specific memory network.

Description

Cross-modal pedestrian re-identification method based on modal specific memory network
Technical Field
The invention relates to the field of computer vision, in particular to a cross-modal pedestrian re-identification method, a re-identification device, electronic equipment and a storage medium based on a modal specific memory network.
Background
Pedestrian re-identification is a technique to match pedestrian images at different camera perspectives. The pedestrian re-identification technology can be combined with pedestrian detection and pedestrian tracking technologies, and is widely applied to video monitoring, intelligent security, criminal investigation and the like.
However, the method for re-identifying the pedestrian in the prior art cannot fully utilize the cross-modal information of the pedestrian for identification, or the cross-modal identification method has the problems of low identification accuracy, poor identification effect and the like.
Disclosure of Invention
In view of the foregoing, the present invention provides a method, an electronic device, a storage medium, and a computer program product for cross-modal model training based on a modality-specific memory network, so as to solve at least one of the above problems.
According to an embodiment of the present invention, a cross-modal pedestrian re-identification method based on a modal specific memory network is provided, including:
acquiring a pedestrian image to be re-identified and a re-identification type;
according to the re-recognition type, processing the pedestrian image to be re-recognized by using a cross-modal model based on a modal specific memory network to obtain a re-recognition result, wherein the cross-modal model based on the modal specific memory network is obtained by training in the following method:
respectively processing a visible light image and an infrared image of a pedestrian by using a feature extraction module to obtain a visible light image feature map and an infrared image feature map;
carrying out average pooling on each segmentation part in the visible light image feature map to obtain visible light features, and carrying out average pooling on each segmentation part in the infrared image feature map to obtain infrared features;
reconstructing visible light characteristics and infrared characteristics of the pedestrian by using a mode specific memory network module to obtain visible light reconstruction characteristics and infrared reconstruction characteristics of the pedestrian, wherein the mode specific memory network module is used for storing and transmitting the visible light reconstruction characteristics and the infrared reconstruction characteristics of the pedestrian;
processing visible light characteristics, infrared characteristics, visible light reconstruction characteristics and infrared reconstruction characteristics of the pedestrians by using the unified characteristic alignment module to obtain multi-modal unified characteristics of the pedestrians, wherein the multi-modal unified characteristics comprise the visible light unified characteristics and the infrared unified characteristics;
and optimizing a cross-modal model according to a preset loss function by utilizing the visible light characteristic, the infrared characteristic, the visible light reconstruction characteristic, the infrared reconstruction characteristic and the multi-modal unified representation of the pedestrian until the value of the preset loss function meets a preset condition, and obtaining the trained cross-modal model based on the modal specific memory network.
According to an embodiment of the present invention, the reconstructing the visible light feature and the infrared feature of the pedestrian by using the mode specific memory network module to obtain the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian includes:
respectively processing the visible light characteristics and the infrared characteristics by using a modal specific memory network to obtain visible light memory items and infrared memory items;
calculating the cosine similarity of the visible light characteristics and the visible light memory term to obtain the cosine similarity of the visible light;
carrying out normalization processing on the visible light cosine similarity to obtain a visible light normalization vector;
acquiring infrared reconstruction characteristics according to the infrared memory term and the visible light normalization vector;
calculating the cosine similarity of the infrared features and the infrared memory items to obtain the infrared cosine similarity;
carrying out normalization processing on the infrared cosine similarity to obtain an infrared normalized vector;
and obtaining visible light reconstruction characteristics according to the visible light memory term and the infrared normalized vector.
According to the embodiment of the present invention, the above-mentioned visible light cosine similarity is determined by formula (1):
Figure BDA0003608811790000021
wherein the content of the first and second substances,
Figure BDA0003608811790000022
the characteristics of the visible light are represented,
Figure BDA0003608811790000023
representing a visible-light memory item;
wherein the infrared reconstruction characteristic is determined by formula (2):
Figure BDA0003608811790000024
wherein the content of the first and second substances,
Figure BDA0003608811790000025
the infrared memory item is represented by a character string,
Figure BDA0003608811790000026
a k-th value representing an n-dimensional visible light normalization vector,
Figure BDA0003608811790000027
determined by equation (3):
Figure BDA0003608811790000031
where τ represents the visible light temperature coefficient.
According to the embodiment of the present invention, the infrared cosine similarity is determined by formula (4):
Figure BDA0003608811790000032
wherein the content of the first and second substances,
Figure BDA0003608811790000033
the characteristic of the infrared light is represented,
Figure BDA0003608811790000034
representing an infrared memory item;
wherein the visible light reconstruction characteristic is determined by equation (5):
Figure BDA0003608811790000035
wherein the content of the first and second substances,
Figure BDA0003608811790000036
the visual-light memory item is represented,
Figure BDA0003608811790000037
a k-th value representing an n-dimensional infrared normalized vector,
Figure BDA0003608811790000038
determined by equation (6):
Figure BDA0003608811790000039
wherein τ represents an infrared temperature coefficient.
According to the embodiment of the present invention, the processing of the visible light feature, the infrared feature, the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian by using the unified feature alignment module to obtain the multi-modal unified characterization of the pedestrian includes:
fusing the visible light characteristic and the infrared reconstruction characteristic by using a unified characteristic alignment module to obtain a visible light unified characteristic;
and fusing the infrared characteristic and the visible light reconstruction characteristic by using the unified characteristic alignment module to obtain an infrared unified characterization.
According to the embodiment of the present invention, the preset loss function is determined by equation (7):
Figure BDA00036088117900000310
wherein the content of the first and second substances,
Figure BDA00036088117900000311
the representation modalities uniformly characterize the classification loss function,
Figure BDA00036088117900000312
a function representing a modal characteristic classification penalty function,
Figure BDA00036088117900000313
a central triplet of loss functions is represented,
Figure BDA00036088117900000314
a reconstructed uniform loss function is represented that is,
Figure BDA00036088117900000315
representing a mode-specific memory term loss function,
Figure BDA00036088117900000316
representing a mode-specific memory term discriminant loss function,
Figure BDA00036088117900000317
representing a reconstruction loss function, λ align Weighting factor, lambda, representing a loss function of a mode-specific memory term dis Weighting factor, lambda, representing the discriminant loss function of the mode-specific memory term rec Representing the weighting coefficients of the reconstruction loss function.
According to the embodiment of the present invention, the above-mentioned modal unified characterization classification loss function is determined by formula (8):
Figure BDA0003608811790000041
wherein the modal feature classification loss function is determined by equation (9):
Figure BDA0003608811790000042
wherein the reconstructed uniform loss function is determined by equation (10):
Figure BDA0003608811790000043
wherein the reconstruction loss function is determined by equation (11):
Figure BDA0003608811790000044
wherein the mode-specific memory term loss function is determined by equation (12):
Figure BDA0003608811790000045
wherein the mode specific memory term discriminant loss function is determined by equation (13):
Figure BDA0003608811790000046
wherein, y V Visible light image tag, y, representing a pedestrian I Infrared image tag of pedestrian, f V Representing a characteristic of visible light, f I The characteristic of the infrared light is represented,
Figure BDA0003608811790000047
the visible light reconstruction characteristic is represented and,
Figure BDA0003608811790000048
representing infrared reconstruction characteristics, E { V, I } representing visible light characteristics or infrared characteristicsSymbol, m * Representing a memory item, A V Normalized vector of visible light, A I Representing the infrared normalized vector.
According to an embodiment of the present invention, there is provided an electronic apparatus including:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a cross-modal pedestrian re-identification method based on a modal-specific memory network as described above.
According to an embodiment of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to execute the above-mentioned cross-modal pedestrian re-identification method based on a modal-specific memory network.
According to an embodiment of the present invention, there is provided a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for cross-modal pedestrian re-identification based on a modal-specific memory network is implemented.
The cross-modal pedestrian re-identification method provided by the invention is based on the modal specific memory network, and the cross-modal characteristics of the pedestrian are processed through a pre-trained cross-modal pedestrian re-identification model based on the modal specific memory network, so that the corresponding relation between the visible light modal characteristics and the infrared modal characteristics of the pedestrian is established, and the cross-modal pedestrian re-identification with higher identification accuracy and good identification efficiency is realized.
Drawings
FIG. 1 is a flowchart of a cross-modal pedestrian re-identification method based on a modal-specific memory network according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of training a cross-modal model based on a modal-specific memory network according to an embodiment of the invention;
FIG. 3 is a flow chart for obtaining multi-modal reconstruction features of a pedestrian according to an embodiment of the invention;
FIG. 4 is a flow diagram for obtaining a multi-modal unified characterization of a pedestrian according to an embodiment of the present invention;
FIG. 5 is a diagram of a training framework for a cross-modal model based on a modal-specific memory network, according to an embodiment of the present invention;
fig. 6 schematically shows a block diagram of an electronic device adapted for a cross-modal pedestrian re-identification method based on a modal-specific memory network and a training method of a cross-modal model based on a modal-specific memory network according to an embodiment of the invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
The existing pedestrian re-identification method mainly focuses on searching among visible light pedestrian images shot by a common camera under a daytime scene, and can be regarded as a problem of single-mode image matching. However, in an environment with poor lighting conditions such as at night, it is difficult for a general camera to capture effective appearance information of pedestrians. To overcome this limitation, some surveillance cameras can freely switch between visible and infrared modes as lighting conditions change. Therefore, it is necessary to design an effective model to realize pedestrian retrieval between visible light and infrared images, i.e., to solve the problem of cross-modal pedestrian reconstruction.
Current cross-modal pedestrian re-identification methods can be generally categorized into two categories: a modality sharing feature learning class method and a modality information completion class method. Modality-shared feature learning class methods attempt to embed images of different modalities into a shared feature space. However, since the visual appearance of visible and infrared images is very different, how to embed images of different morphologies directly into a shared feature space remains a difficult problem. Further, since the modality information such as the color of the visible light image is regarded as redundant information by this type of method, the discriminativity of the feature representation by the modality sharing feature learning type method is limited. To solve this problem, a method of modality information completion class is proposed, the object of which is to complete modality information of another with information of an input modality. However, since the model only uses a single modality for input, it is difficult to fill in the missing modality information to solve the modality difference problem.
In view of the above, the present application provides a cross-modal model training method based on a modal-specific memory network, a pedestrian re-identification method, and an electronic device. According to the pedestrian re-identification method, the cross-modal model based on the modal specific memory network is obtained through the cross-modal model training method based on the modal specific memory network, so that missing modal information completion is achieved, the problem of modal difference in cross-modal pedestrian re-identification is solved, and whether pedestrian images in different modalities belong to the same pedestrian or not is judged.
In the technical scheme of the invention, the acquisition, storage, application and the like of the related pedestrian information all accord with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.
Fig. 1 is a flowchart of a cross-modal pedestrian re-identification method based on a modal-specific memory network according to an embodiment of the present invention.
As shown in fig. 1, the pedestrian re-identification method includes operations S110 to S120.
In operation S110, acquiring a pedestrian image to be re-identified and a re-identification type;
in operation S120, according to the re-recognition type, the pedestrian image to be re-recognized is processed by using the cross-modal model based on the modal specific memory network, so as to obtain a re-recognition result.
FIG. 2 is a flowchart of a training method for obtaining a cross-modal model based on a modal-specific memory network, according to an embodiment of the invention.
As shown in fig. 2, the method for training the cross-modal model based on the modal-specific memory network includes operations S210 to S250.
In operation S210, the visible light image and the infrared image of the pedestrian are respectively processed by the feature extraction module to obtain a visible light image feature map and an infrared image feature map.
The feature extraction module preferably employs a dual-stream convolutional neural network whose first two convolutional blocks are modality-specific (e.g., those dedicated to processing visible light) to capture modality-specific low-level features (higher resolution of low-level features, containing more location, detail information, but less semantic, more noisy due to fewer convolutions passed) patterns, while the parameters of the deep convolutional blocks are shared by both modalities (both visible and infrared).
In operation S220, each of the segments in the visible-light image feature map is averaged and pooled to obtain visible-light features, and each of the segments in the infrared image feature map is averaged and pooled to obtain infrared features.
In operation S230, the visible light feature and the infrared feature of the pedestrian are reconstructed by using the modality specific memory network module, so as to obtain the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian.
The mode specific memory network module is used for storing prototype characteristics of each mode (visible light or infrared), and simultaneously, the mode specific memory network module can be used for storing and transmitting the visible light reconstruction characteristics and the infrared reconstruction characteristics of the pedestrian
In operation S240, the visible light feature, the infrared feature, the visible light reconstruction feature, and the infrared reconstruction feature of the pedestrian are processed by using the unified feature alignment module, so as to obtain a multi-modal unified representation of the pedestrian.
In operation S250, a cross-modal model based on a modal specific memory network is obtained by optimizing the cross-modal model according to a preset loss function by using the visible light feature, the infrared feature, the visible light reconstruction feature, the infrared reconstruction feature, and the multi-modal unified representation of the pedestrian until a value of the preset loss function satisfies a preset condition.
According to the trans-modal model training method based on the modal specific memory network, the visible light image characteristics and the infrared image characteristics of pedestrians are obtained by processing the visible light images and the infrared images of the pedestrians, the visible light image characteristics and the infrared image characteristics are reconstructed by using the modal specific memory network, so that the visible light and infrared reconstruction characteristics of the pedestrians are obtained, meanwhile, the reconstruction characteristics are processed by using the uniform alignment module, the visible light and infrared uniform characteristics of the pedestrians are obtained, and then the multi-modal characteristics and the preset loss function are used for training and optimizing the trans-modal model based on the modal specific memory network; the model is optimized through iterative training, and the cross-modal model based on the modal specific memory network with high recognition accuracy and good recognition effect is obtained.
The cross-modal pedestrian re-identification method provided by the invention is based on the modal specific memory network, and the cross-modal characteristics of the pedestrian are processed through a pre-trained cross-modal pedestrian re-identification model based on the modal specific memory network, so that the corresponding relation between the visible light modal characteristics and the infrared modal characteristics of the pedestrian is established, and the cross-modal pedestrian re-identification with higher identification accuracy and good identification efficiency is realized.
The above-described acquisition of the visible light characteristic and the infrared characteristic of the pedestrian is described in detail below with reference to specific embodiments.
For a given image (such as a visible light image of a pedestrian or an infrared image of a pedestrian), a visible light image feature map can be extracted
Figure BDA0003608811790000081
And infrared image feature map
Figure BDA0003608811790000082
Wherein, H, W, C respectively represent the height, width and number of channels of the characteristic diagram. Then F is mixed V And F I Dividing the horizontal direction into K parts, and pooling each part to obtain local feature vectors
Figure BDA0003608811790000083
Figure BDA0003608811790000084
And
Figure BDA0003608811790000085
wherein K is 1,2, …, K.
FIG. 3 is a flow chart for obtaining multi-modal reconstruction features of pedestrians according to an embodiment of the invention.
As shown in fig. 3, the processing of the multi-modal characteristics of the pedestrian by using the modality-specific memory network module to obtain the multi-modal reconstructed characteristics of the pedestrian includes operations S310 to S370.
In operation S310, the visible light feature and the infrared feature are respectively processed by using the modality specific memory network, so as to obtain a visible light memory item and an infrared memory item.
The memory items are each item in the modality specific memory network, and specifically, representative samples are stored in the memory network.
In operation S320, the cosine similarity between the visible light feature and the visible light memory term is calculated to obtain the visible light cosine similarity.
In operation S330, the visible light cosine similarity is normalized to obtain a visible light normalization vector.
The above formula for normalizing the vector of the visible light
Figure BDA0003608811790000086
And (4) showing.
In operation S340, an infrared reconstruction feature is obtained according to the infrared memory term and the visible light normalization vector.
In operation S350, the cosine similarity between the infrared feature and the infrared memory term is calculated to obtain the infrared cosine similarity.
In operation S360, the infrared cosine similarity is normalized to obtain an infrared normalized vector.
The above infrared normalized vector is represented by the formula
Figure BDA0003608811790000091
And (4) showing.
In operation S370, a visible light reconstruction feature is obtained according to the visible light memory term and the infrared normalization vector.
The above-described multi-modal reconstruction features for pedestrians are described in further detail below with reference to specific embodiments.
The mode specific memory network module is used for accurately storing and transmitting information between a visible light mode and an infrared mode and obtaining unificationIs shown. Given an input image (e.g., a visible light image or an infrared image), it may be read from the memory network to reconstruct missing modal features. For example, given a visible light image, its infrared signature can be reconstructed. To achieve this goal, modality-specific memory items are introduced
Figure BDA0003608811790000092
And
Figure BDA0003608811790000093
here, N denotes the number of memory items each part uses to model a local change. Modality-specific memory items (specific memory items such as those specific to visible light) are arranged in pairs, each item corresponding to a prototype feature of a visible or infrared modality.
According to the embodiment of the present invention, the above-mentioned visible light cosine similarity is determined by formula (1):
Figure BDA0003608811790000094
wherein the content of the first and second substances,
Figure BDA0003608811790000095
the characteristics of the visible light are represented,
Figure BDA0003608811790000096
representing a visible-light memory item;
wherein the infrared reconstruction characteristic is determined by formula (2):
Figure BDA0003608811790000097
wherein the content of the first and second substances,
Figure BDA0003608811790000098
the infrared memory item is represented by a character string,
Figure BDA0003608811790000099
a k-th value representing an n-dimensional visible light normalization vector,
Figure BDA00036088117900000910
determined by equation (3):
Figure BDA00036088117900000911
where τ represents the visible light temperature coefficient.
According to the embodiment of the present invention, the infrared cosine similarity is determined by formula (4):
Figure BDA00036088117900000912
wherein the content of the first and second substances,
Figure BDA0003608811790000101
the characteristic of the infrared light is represented,
Figure BDA0003608811790000102
representing an infrared memory item;
wherein the visible light reconstruction characteristic is determined by equation (5):
Figure BDA0003608811790000103
wherein the content of the first and second substances,
Figure BDA0003608811790000104
the visual-light memory item is represented,
Figure BDA0003608811790000105
a k-th value representing an n-dimensional infrared normalized vector,
Figure BDA0003608811790000106
determined by equation (6):
Figure BDA0003608811790000107
wherein τ represents an infrared temperature coefficient.
The visible light reconstruction characteristics and the infrared reconstruction characteristics of pedestrians can be respectively calculated through the formulas (1) to (6), and the multi-mode reconstruction characteristics calculated and processed according to the formulas can play a role in mutual mapping and comparison in the cross-mode recognition process, so that the cross-mode recognition efficiency is improved.
FIG. 4 is a flow diagram for obtaining a multi-modal unified characterization of a pedestrian according to an embodiment of the present invention.
As shown in fig. 4, the processing of the multi-modal features of the pedestrian and the multi-modal reconstructed features of the pedestrian by the unified feature alignment module to obtain the multi-modal unified representation of the pedestrian includes operations S410 to S420.
In operation S410, the visible light feature and the infrared reconstruction feature are fused by using the unified feature alignment module to obtain a unified visible light feature.
In operation S420, the infrared feature and the visible light reconstruction feature are fused by using the unified feature alignment module, so as to obtain an infrared unified representation.
After the reconstructed missing modal characteristics of the pedestrian are obtained, the reconstructed missing modal characteristics are added into the input characteristics to obtain unified characteristic representation:
Figure BDA0003608811790000108
wherein the content of the first and second substances,
Figure BDA0003608811790000109
it is shown that the visible light is uniformly characterized,
Figure BDA00036088117900001010
representing the infrared unified characterization, h (-) is a fusion layer consisting of a linear layer and a batch normalization layer. By fusing the original features and the reconstructed modality features, the visible and infrared images are naturally embedded into a common feature space.
According to the embodiment of the present invention, the preset loss function is determined by equation (7):
Figure BDA00036088117900001011
wherein the content of the first and second substances,
Figure BDA00036088117900001012
the representation modalities uniformly characterize the classification loss function,
Figure BDA00036088117900001013
a function representing a modal characteristic classification penalty function,
Figure BDA00036088117900001014
a central triplet of loss functions is represented,
Figure BDA00036088117900001015
a reconstructed uniform loss function is represented that is,
Figure BDA0003608811790000111
representing a mode-specific memory term loss function,
Figure BDA0003608811790000112
representing a mode-specific memory term discriminant loss function,
Figure BDA0003608811790000113
representing a reconstruction loss function, λ align Weighting factor, lambda, representing a loss function of a mode-specific memory term dis Weighting factor, lambda, representing the discriminant loss function of the mode-specific memory term rec Representing the weighting coefficients of the reconstruction loss function.
Through the various loss functions, the optimization efficiency and the optimization effect of the cross-modal model based on the modal specific memory network can be improved.
According to the embodiment of the present invention, the above-mentioned modal unified characterization classification loss function is determined by formula (8):
Figure BDA0003608811790000114
the above-described modalities collectively characterize a classification loss function for predicting the identity of a pedestrian.
Wherein the modal feature classification loss function is determined by equation (9):
Figure BDA0003608811790000115
the modal feature classification loss function described above is used to make local features from both modalities (visible and infrared) discriminative.
Wherein the reconstructed uniform loss function is determined by equation (10):
Figure BDA0003608811790000116
the reconstruction consistency loss function is used for ensuring that the characteristics of the memory network reconstruction are consistent with the characteristics extracted by the backbone network, and two mode discriminators are utilized
Figure BDA0003608811790000117
And
Figure BDA0003608811790000118
to reconstructed modal characteristics
Figure BDA0003608811790000119
And
Figure BDA00036088117900001110
and (6) classifying.
Wherein the reconstruction loss function is determined by equation (11):
Figure BDA00036088117900001111
the above reconstruction loss function is used to ensure that the free phase can be usedThe homomodal memory terms reconstruct the input features. First, the reconstructed input features are obtained:
Figure BDA00036088117900001112
the euclidean distance between the input features and the reconstructed input features is then minimized.
Wherein the mode-specific memory term loss function is determined by equation (12):
Figure BDA00036088117900001113
the above-mentioned mode-specific memory term loss function is used for aligning the correspondence between the memory terms of the visible light and infrared modes, wherein D KL (. -) represents the KL divergence.
Wherein the mode specific memory term discriminant loss function is determined by equation (13):
Figure BDA0003608811790000121
since the memory terms store prototypical features of each modality, they should have sufficient discriminative power to represent various patterns of pedestrian images. The above-mentioned modality-specific memory term discriminant loss function is used to make the multi-modal memory term distinguishable.
Wherein, y V Visible light image tag, y, representing a pedestrian I Infrared image tag of pedestrian, f V Representing a characteristic of visible light, f I The characteristic of the infrared light is represented,
Figure BDA0003608811790000122
a visible light reconstruction characteristic is represented and,
Figure BDA0003608811790000123
representing infrared reconstruction characteristics, ∈ { V, I } representing visible light characteristics or infrared characteristics, m * Representing a memory item, A V Normalized vector of visible light, A I Representing the infrared normalized vector.
FIG. 5 is a diagram of a training framework for a cross-modal model based on a modal-specific memory network, according to an embodiment of the invention.
The training process of the above model is described in further detail below with reference to fig. 5.
As shown in fig. 5, the inputs to the above model are visible light images and infrared images of the pedestrian. Firstly, the feature extraction module of the model processes the visible light image and the infrared image respectively for obtaining the visible light feature and the infrared feature of the pedestrian, and in the process, the related loss function (such as the discriminator D) V And D I ) Can be used to optimize the results of the feature extraction model; secondly, inputting the visible light characteristics and the infrared characteristics into a mode-specific memory network module (the mode-specific memory network module refers to a neural network specially used for a certain mode, such as a neural network specially used for processing the visible light modes), and obtaining specific memory items (such as visible light memory items) of different modes in the multiple modes and obtaining multi-mode reconstruction characteristics in the module; and finally, fusing the multi-modal reconstruction features and the multi-modal features by using a unified feature alignment module to obtain a multi-modal unified representation. The training framework does not need an image generation process, and the whole network can be trained end to end; the method relieves the modal difference problem by complementing the modal loss characteristics through the modal specific memory network, can complement the loss modal characteristics only by using single-modal input, obtains a uniform characteristic space by aggregating the original and loss modal characteristics, and can well relieve the modal difference problem.
According to the pedestrian re-identification method, the trained trans-modal model based on the modal specific memory network is obtained through the training method of the trans-modal model based on the modal specific memory network, the trained trans-modal model based on the modal specific memory network is utilized to re-identify pedestrians, missing modal information of the pedestrians can be completed according to an input single-modal pedestrian image, whether the pedestrian images in different modalities belong to the same pedestrian or not is further judged, accuracy of pedestrian re-identification is improved, meanwhile, the method can be widely applied to scenes such as security systems and smart cities, can also be installed on front-end equipment in a software mode, real-time visible light-near infrared pedestrian images are matched or deployed on a background server of a company, and large-batch visible light-near infrared pedestrian image retrieval and matching results are provided.
Fig. 6 schematically shows a block diagram of an electronic device adapted for a cross-modal pedestrian re-identification method based on a modal-specific memory network and a training method of a cross-modal model based on a modal-specific memory network according to an embodiment of the invention.
As shown in fig. 6, an electronic device 600 according to an embodiment of the present invention includes a processor 601 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. Processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 601 may also include onboard memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.
In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present invention by executing programs in the ROM 602 and/or RAM 603. It is to be noted that the program may also be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in one or more memories.
Electronic device 600 may also include input/output (I/O) interface 605, where input/output (I/O) interface 605 is also connected to bus 604, according to an embodiment of the invention. The electronic device 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.
According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the present invention, a computer-readable storage medium may include the ROM 602 and/or the RAM 603 described above and/or one or more memories other than the ROM 602 and the RAM 603.
Embodiments of the invention also include a computer program product comprising a computer program comprising program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the cross-modal pedestrian re-identification method based on the modal-specific memory network and the training method based on the cross-modal model of the modal-specific memory network, which are provided by the embodiment of the invention.
The computer program performs the above-described functions defined in the system/apparatus of the embodiment of the present invention when executed by the processor 601. The above described systems, devices, modules, units, etc. may be implemented by computer program modules according to embodiments of the present invention.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 609, and/or installed from the removable medium 611. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the system of the embodiment of the present invention. The above described systems, devices, apparatuses, modules, units, etc. may be implemented by computer program modules according to embodiments of the present invention.
The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above embodiments are only examples of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A cross-modal pedestrian re-identification method based on a modal specific memory network comprises the following steps:
acquiring a pedestrian image to be re-identified and a re-identification type;
according to the re-recognition type, processing the pedestrian image to be re-recognized by using the cross-modal model based on the modal specific memory network to obtain a re-recognition result, wherein the cross-modal model based on the modal specific memory network is obtained by training in the following method:
respectively processing a visible light image and an infrared image of a pedestrian by using a feature extraction module to obtain a visible light image feature map and an infrared image feature map;
performing average pooling on each segmentation part in the visible light image feature map to obtain visible light features, and performing average pooling on each segmentation part in the infrared image feature map to obtain infrared features;
reconstructing the visible light characteristics and the infrared characteristics of the pedestrian by using a mode specific memory network module to obtain visible light reconstruction characteristics and infrared reconstruction characteristics of the pedestrian, wherein the mode specific memory network module is used for storing and transmitting the visible light reconstruction characteristics and the infrared reconstruction characteristics of the pedestrian;
processing the visible light feature, the infrared feature, the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian by using a unified feature alignment module to obtain a multi-modal unified representation of the pedestrian, wherein the multi-modal unified representation comprises a visible light unified representation and an infrared unified representation;
and optimizing a cross-modal model according to a preset loss function by using the visible light characteristic, the infrared characteristic, the visible light reconstruction characteristic, the infrared reconstruction characteristic and the multi-modal unified representation of the pedestrian until the value of the preset loss function meets a preset condition, and obtaining the trained cross-modal model based on the modal specific memory network.
2. The method of claim 1, wherein the reconstructing the visible light features and the infrared features of the pedestrian using a modality-specific memory network module, the obtaining visible light reconstructed features and infrared reconstructed features of the pedestrian comprises:
respectively processing the visible light characteristics and the infrared characteristics by using the mode specific memory network to obtain visible light memory items and infrared memory items;
calculating the cosine similarity of the visible light characteristics and the visible light memory term to obtain the cosine similarity of the visible light;
normalizing the visible light cosine similarity to obtain a visible light normalized vector;
acquiring the infrared reconstruction characteristics according to the infrared memory items and the visible light normalized vector;
calculating the cosine similarity of the infrared features and the infrared memory items to obtain the infrared cosine similarity;
carrying out normalization processing on the infrared cosine similarity to obtain an infrared normalized vector;
and obtaining the visible light reconstruction characteristics according to the visible light memory term and the infrared normalized vector.
3. The method of claim 2, wherein the visible light cosine similarity is determined by equation (1):
Figure FDA0003608811780000021
wherein the content of the first and second substances,
Figure FDA0003608811780000022
the visible light characteristic is represented by a visible light,
Figure FDA0003608811780000023
representing a visible memory item;
wherein the infrared reconstruction characteristic is determined by equation (2):
Figure FDA0003608811780000024
wherein the content of the first and second substances,
Figure FDA0003608811780000025
the infrared memory item is represented by a character string,
Figure FDA0003608811780000026
a k-th value representing an n-dimensional visible light normalization vector,
Figure FDA0003608811780000027
determined by equation (3):
Figure FDA0003608811780000028
where τ represents the visible light temperature coefficient.
4. The method of claim 2, wherein the infrared cosine similarity is determined by equation (4):
Figure FDA0003608811780000029
wherein the content of the first and second substances,
Figure FDA00036088117800000210
the infrared characteristics are represented by a representation of the infrared,
Figure FDA00036088117800000211
representing an infrared memory item;
wherein the visible light reconstruction characteristic is determined by equation (5):
Figure FDA00036088117800000212
wherein the content of the first and second substances,
Figure FDA00036088117800000213
the visual-light memory item is represented,
Figure FDA00036088117800000214
a k-th value representing an n-dimensional infrared normalized vector,
Figure FDA00036088117800000215
determined by equation (6):
Figure FDA0003608811780000031
wherein τ represents an infrared temperature coefficient.
5. The method of claim 1, wherein the processing the visible light features, the infrared features, the visible light reconstruction features, and the infrared reconstruction features of the pedestrian with a unified feature alignment module to obtain a multi-modal unified characterization of the pedestrian comprises:
fusing the visible light characteristic and the infrared reconstruction characteristic by using a unified characteristic alignment module to obtain a visible light unified characteristic;
and fusing the infrared characteristic and the visible light reconstruction characteristic by using a unified characteristic alignment module to obtain an infrared unified characteristic.
6. The method of claim 1, wherein the preset loss function is determined by equation (7):
Figure FDA0003608811780000032
wherein the content of the first and second substances,
Figure FDA0003608811780000033
the representation modalities uniformly characterize the classification loss function,
Figure FDA0003608811780000034
display moduleA state-feature classification loss function that,
Figure FDA0003608811780000035
a central triplet of loss functions is represented,
Figure FDA0003608811780000036
a reconstructed uniform loss function is represented that is,
Figure FDA0003608811780000037
representing a mode-specific memory term loss function,
Figure FDA0003608811780000038
representing a mode-specific memory term discriminant loss function,
Figure FDA0003608811780000039
representing a reconstruction loss function, λ align Weighting factor, lambda, representing a loss function of a mode-specific memory term dis Weighting factor, lambda, representing the discriminant loss function of the mode-specific memory term rec Representing the weighting coefficients of the reconstruction loss function.
7. The method according to claim 6, wherein the modal unified characterization classification loss function is determined by equation (8):
Figure FDA00036088117800000310
wherein the modal feature classification loss function is determined by equation (9):
Figure FDA00036088117800000311
wherein the reconstructed uniform loss function is determined by equation (10):
Figure FDA00036088117800000312
wherein the reconstruction loss function is determined by equation (11):
Figure FDA00036088117800000313
wherein the modality-specific memory term loss function is determined by equation (12):
Figure FDA0003608811780000041
wherein the mode specific memory term discriminant loss function is determined by equation (13):
Figure FDA0003608811780000042
wherein, y V Visible light image tag, y, representing a pedestrian I Infrared image tag of pedestrian, f V Representing a characteristic of visible light, f I The characteristic of the infrared light is represented,
Figure FDA0003608811780000043
the visible light reconstruction characteristic is represented and,
Figure FDA0003608811780000044
representing infrared reconstruction characteristics, ∈ { V, I } representing visible light characteristics or infrared characteristics, m * Representing a memory item, A V Normalized vector of visible light, A I Representing the infrared normalized vector.
8. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.
10. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 7.
CN202210426984.4A 2022-04-21 2022-04-21 Cross-modal pedestrian re-identification method based on modal specific memory network Active CN114882525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210426984.4A CN114882525B (en) 2022-04-21 2022-04-21 Cross-modal pedestrian re-identification method based on modal specific memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210426984.4A CN114882525B (en) 2022-04-21 2022-04-21 Cross-modal pedestrian re-identification method based on modal specific memory network

Publications (2)

Publication Number Publication Date
CN114882525A true CN114882525A (en) 2022-08-09
CN114882525B CN114882525B (en) 2024-04-02

Family

ID=82671510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210426984.4A Active CN114882525B (en) 2022-04-21 2022-04-21 Cross-modal pedestrian re-identification method based on modal specific memory network

Country Status (1)

Country Link
CN (1) CN114882525B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120936A1 (en) * 2016-12-27 2018-07-05 Zhejiang Dahua Technology Co., Ltd. Systems and methods for fusing infrared image and visible light image
CN112016401A (en) * 2020-08-04 2020-12-01 杰创智能科技股份有限公司 Cross-modal-based pedestrian re-identification method and device
CN114220124A (en) * 2021-12-16 2022-03-22 华南农业大学 Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
CN114241517A (en) * 2021-12-02 2022-03-25 河南大学 Cross-modal pedestrian re-identification method based on image generation and shared learning network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018120936A1 (en) * 2016-12-27 2018-07-05 Zhejiang Dahua Technology Co., Ltd. Systems and methods for fusing infrared image and visible light image
CN112016401A (en) * 2020-08-04 2020-12-01 杰创智能科技股份有限公司 Cross-modal-based pedestrian re-identification method and device
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device
CN114241517A (en) * 2021-12-02 2022-03-25 河南大学 Cross-modal pedestrian re-identification method based on image generation and shared learning network
CN114220124A (en) * 2021-12-16 2022-03-22 华南农业大学 Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯敏;张智成;吕进;余磊;韩斌;: "基于生成对抗网络的跨模态行人重识别研究", 现代信息科技, no. 04, 25 February 2020 (2020-02-25) *
潘磊;尹义龙;李徐周;: "基于得分的近红外线与可见光图像融合算法", 计算机工程, no. 04, 15 April 2013 (2013-04-15) *

Also Published As

Publication number Publication date
CN114882525B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US20180089534A1 (en) Cross-modiality image matching method
CN111104867B (en) Recognition model training and vehicle re-recognition method and device based on part segmentation
Kang et al. Deep learning-based weather image recognition
Varghese et al. An efficient algorithm for detection of vacant spaces in delimited and non-delimited parking lots
WO2023173599A1 (en) Method and apparatus for classifying fine-granularity images based on image block scoring
Tao et al. Smoke vehicle detection based on multi-feature fusion and hidden Markov model
CN111373393B (en) Image retrieval method and device and image library generation method and device
CN111067522A (en) Brain addiction structural map assessment method and device
CN114170516A (en) Vehicle weight recognition method and device based on roadside perception and electronic equipment
CN111078940A (en) Image processing method, image processing device, computer storage medium and electronic equipment
Liu et al. Registration of infrared and visible light image based on visual saliency and scale invariant feature transform
CN115862055A (en) Pedestrian re-identification method and device based on comparison learning and confrontation training
CN115620090A (en) Model training method, low-illumination target re-recognition method and device and terminal equipment
CN111783654A (en) Vehicle weight identification method and device and electronic equipment
Dai Uncertainty-aware accurate insulator fault detection based on an improved YOLOX model
Ying et al. Tyre pattern image retrieval–current status and challenges
CN112990152B (en) Vehicle weight identification method based on key point detection and local feature alignment
CN114168768A (en) Image retrieval method and related equipment
CN112861776A (en) Human body posture analysis method and system based on dense key points
CN116994332A (en) Cross-mode pedestrian re-identification method and system based on contour map guidance
CN112633089B (en) Video pedestrian re-identification method, intelligent terminal and storage medium
CN114882525B (en) Cross-modal pedestrian re-identification method based on modal specific memory network
CN112380369B (en) Training method, device, equipment and storage medium of image retrieval model
Hadjkacem et al. Multi-shot human re-identification using a fast multi-scale video covariance descriptor
CN116052220B (en) Pedestrian re-identification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant