CN114882525B

CN114882525B - Cross-modal pedestrian re-identification method based on modal specific memory network

Info

Publication number: CN114882525B
Application number: CN202210426984.4A
Authority: CN
Inventors: 张天柱; 刘翔; 张勇东; 李昱霖; 吴枫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-04-02
Anticipated expiration: 2042-04-21
Also published as: CN114882525A

Abstract

The invention provides a cross-mode pedestrian re-identification method based on a mode specific memory network, which comprises the following steps: acquiring a pedestrian image to be re-identified and a re-identification type; and processing the pedestrian image to be re-identified by using a cross-mode pedestrian re-identification model based on the mode specific memory network according to the re-identification type to obtain a re-identification result. The invention also provides the electronic equipment, the storage medium and the computer program product for realizing the cross-mode pedestrian re-identification method based on the mode specific memory network.

Description

Cross-modal pedestrian re-identification method based on modal specific memory network

Technical Field

The invention relates to the field of computer vision, in particular to a cross-mode pedestrian re-identification method, a re-identification device, electronic equipment and a storage medium based on a mode specific memory network.

Background

Pedestrian re-recognition is a technique that matches images of pedestrians at different camera perspectives. The pedestrian re-recognition technology can be combined with pedestrian detection and pedestrian tracking technologies, and has wide application in the aspects of video monitoring, intelligent security, criminal investigation and the like.

However, the method related to pedestrian re-recognition in the prior art has the problems that the recognition can not be performed by fully utilizing the cross-modal information of the pedestrians, or the cross-modal recognition method has the problems of low recognition accuracy, poor recognition effect and the like.

Disclosure of Invention

In view of the foregoing, the present invention provides a method, an electronic device, a storage medium, and a computer program product for training a cross-modal model based on a modal specific memory network, so as to solve at least one of the foregoing problems.

According to the embodiment of the invention, a cross-mode pedestrian re-identification method based on a mode specific memory network is provided, which comprises the following steps:

acquiring a pedestrian image to be re-identified and a re-identification type;

according to the re-recognition type, processing the pedestrian image to be re-recognized by using a cross-modal model based on a modal specific memory network to obtain a re-recognition result, wherein the cross-modal model based on the modal specific memory network is trained by the following method:

the method comprises the steps that a feature extraction module is utilized to respectively process a visible light image and an infrared image of a pedestrian to obtain a visible light image feature map and an infrared image feature map;

carrying out average pooling on each divided part in the visible light image characteristic diagram to obtain visible light characteristics, and carrying out average pooling on each divided part in the infrared image characteristic diagram to obtain infrared characteristics;

reconstructing visible light features and infrared features of the pedestrians by using a mode specific memory network module to obtain visible light reconstruction features and infrared reconstruction features of the pedestrians, wherein the mode specific memory network module is used for storing and transmitting the visible light reconstruction features and the infrared reconstruction features of the pedestrians;

processing visible light features, infrared features, visible light reconstruction features and infrared reconstruction features of pedestrians by using a unified feature alignment module to obtain multi-mode unified characterization of pedestrians, wherein the multi-mode unified characterization comprises visible light unified characterization and infrared unified characterization;

and optimizing the model of the cross-mode according to a preset loss function by utilizing the visible light characteristic, the infrared characteristic, the visible light reconstruction characteristic, the infrared reconstruction characteristic and the multi-mode unified characterization of the pedestrian until the value of the preset loss function meets the preset condition, so as to obtain the model of the cross-mode based on the mode specific memory network after training.

According to an embodiment of the present invention, the reconstructing the visible light characteristic and the infrared characteristic of the pedestrian by using the mode specific memory network module, and obtaining the visible light reconstruction characteristic and the infrared reconstruction characteristic of the pedestrian includes:

respectively processing the visible light characteristic and the infrared characteristic by using a mode specific memory network to obtain a visible light memory item and an infrared memory item;

calculating cosine similarity of the visible light characteristics and the visible light memory term to obtain visible light cosine similarity;

carrying out normalization processing on the visible light cosine similarity to obtain a visible light normalization vector;

obtaining infrared reconstruction characteristics according to the infrared memory items and the visible light normalization vector;

calculating cosine similarity of the infrared characteristics and the infrared memory items to obtain infrared cosine similarity;

carrying out normalization processing on the infrared cosine similarity to obtain an infrared normalization vector;

and obtaining visible light reconstruction characteristics according to the visible light memory term and the infrared normalization vector.

According to an embodiment of the present invention, the visible light cosine similarity is determined by formula (1):

wherein,representing visible light characteristics,/->Representing a visible light memory item;

wherein the infrared reconstruction characteristics are determined by formula (2):

wherein,indicating infrared memory item->K-th value representing n-dimensional visible light normalization vector,>determined by equation (3):

where τ represents the visible light temperature coefficient.

According to an embodiment of the present invention, the above-mentioned infrared cosine similarity is determined by formula (4):

wherein,indicating infrared characteristics>Representing an infrared memory term;

wherein the visible light reconstruction characteristic is determined by formula (5):

wherein,representing visible light memory items,/->K-th value representing n-dimensional infrared normalized vector, ">Determined by equation (6):

where τ represents the infrared temperature coefficient.

According to an embodiment of the present invention, the processing the visible light feature, the infrared feature, the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian by using the unified feature alignment module to obtain the multi-mode unified characterization of the pedestrian includes:

the visible light features and the infrared reconstruction features are fused by utilizing a unified feature alignment module, so that visible light unified characterization is obtained;

and fusing the infrared features and the visible light reconstruction features by using a unified feature alignment module to obtain infrared unified characterization.

According to an embodiment of the present invention, the above-mentioned preset loss function is determined by the formula (7):

wherein,classification loss function for representing unified characterization of modesCount (n)/(l)>Representing the classification loss function of the modal feature,representing the central triplet loss function,/->Representing a reconstructed coincidence loss function->Representing a modality specific memory term loss function, +.>Identifying a loss function representing a modality specific memory term, +.>Representing the reconstruction loss function, lambda _align Weighting coefficients, lambda, representing mode specific memory term loss functions _dis Weighting coefficients, lambda, representing a modal specific memory term discrimination loss function _rec Representing the weighting coefficients of the reconstruction loss function.

According to an embodiment of the present invention, the above-mentioned modal unified characterization classification loss function is determined by formula (8):

wherein the modal feature classification loss function is determined by equation (9):

wherein the reconstructed consistent loss function is determined by equation (10):

wherein the reconstruction loss function is determined by equation (11):

wherein the modality specific memory term loss function is determined by equation (12):

wherein the mode specific memory term discrimination loss function is determined by equation (13):

wherein y is ^V Visible light image tag representing pedestrian, y ^I Infrared image tag of pedestrian, f ^V Representing visible light characteristics, f ^I The characteristic of the infrared light is indicated,representing visible light reconstruction features, +.>Represents infrared reconstruction characteristics, E { V, I } represents visible light characteristics or infrared characteristics, m ^* Representing memory items, A ^V Visible light normalization vector, A ^I Representing the infrared normalized vector.

According to an embodiment of the present invention, there is provided an electronic apparatus including:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a cross-modality pedestrian re-recognition method based on the modality specific memory network as described above.

According to an embodiment of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a cross-modality pedestrian re-recognition method based on a modality specific memory network as described above.

According to an embodiment of the present invention, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a cross-modality pedestrian re-recognition method based on a modality specific memory network as described above.

The cross-modal pedestrian re-recognition method provided by the invention is based on the modal specific memory network, and the cross-modal characteristics of the pedestrians are processed through the pre-trained cross-modal pedestrian re-recognition model based on the modal specific memory network, so that the corresponding relation between the visible light mode characteristics and the infrared mode characteristics of the pedestrians is established, and the cross-modal pedestrian re-recognition with higher recognition accuracy and good recognition efficiency is realized.

Drawings

FIG. 1 is a flow chart of a cross-modality pedestrian re-identification method based on a modality specific memory network according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of training a cross-modal model based on a modal specific memory network in accordance with an embodiment of the invention;

FIG. 3 is a flow chart for acquiring a pedestrian multi-modal reconstruction feature in accordance with an embodiment of the invention;

FIG. 4 is a flow chart of acquiring a multimodal unified characterization of a pedestrian in accordance with an embodiment of the invention;

FIG. 5 is a training framework diagram of a cross-modal model based on a modal specific memory network in accordance with an embodiment of the present invention;

FIG. 6 schematically illustrates a block diagram of an electronic device adapted for a method of cross-modal pedestrian re-recognition based on a modal specific memory network and a method of training based on a cross-modal model of the modal specific memory network, in accordance with an embodiment of the invention.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

The existing pedestrian re-identification method is mainly focused on searching between visible light pedestrian images shot by a common camera in daytime scenes, and can be regarded as a problem of single-mode image matching. However, in an environment with poor illumination conditions such as night, it is difficult for a general camera to capture effective appearance information of pedestrians. To overcome this limitation, some surveillance cameras may be free to switch between visible and infrared modes as lighting conditions change. Therefore, it is necessary to design an effective model to realize pedestrian retrieval between visible light and infrared images, i.e. the problem of cross-modal pedestrian reconstruction.

Current cross-modality pedestrian re-recognition methods can be broadly categorized into two categories: a modal sharing feature learning class method and a modal information complement class method. Modality sharing feature learning class methods attempt to embed images of different modalities into a shared feature space. However, due to the large difference in the appearance of visible and infrared images, it remains a challenge to embed images of different modalities directly into a shared feature space. In addition, since modality information such as color of a visible light image is regarded as redundant information by such a method, discrimination of feature representation by the modality sharing feature learning type method is limited. In order to solve this problem, a method of a modality information complement class is proposed, the object of which is to complement modality information of another with information of an input modality. However, since the model only adopts a single modal input, it is difficult to fill in the missing modal information to solve the problem of modal difference.

In view of the above, the application provides a training method, a pedestrian re-recognition method and an electronic device for a cross-modal model based on a modal specific memory network. According to the pedestrian re-identification method, the cross-modal model based on the modal specific memory network is obtained through the cross-modal model training method based on the modal specific memory network, so that the missing modal information complementation is realized, the problem of modal difference in the cross-modal pedestrian re-identification is solved, and whether pedestrian images of different modalities belong to the same pedestrian is further judged.

In the technical scheme of the invention, the acquisition, storage, application and the like of the related pedestrian information all conform to the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.

FIG. 1 is a flow chart of a cross-modality pedestrian re-identification method based on a modality specific memory network according to an embodiment of the present invention.

As shown in fig. 1, the pedestrian re-recognition method includes operations S110 to S120.

In operation S110, a pedestrian image to be re-recognized and a re-recognition type are acquired;

in operation S120, according to the re-recognition type, the pedestrian image to be re-recognized is processed by using the cross-modal model based on the modal specific memory network, so as to obtain a re-recognition result.

FIG. 2 is a flow chart of a training method to obtain a cross-modal model based on a modal specific memory network, according to an embodiment of the invention.

As shown in fig. 2, the above-mentioned training method of the cross-modal model based on the modal specific memory network includes operations S210 to S250.

In operation S210, the visible light image and the infrared image of the pedestrian are processed by the feature extraction module, respectively, to obtain a visible light image feature map and an infrared image feature map.

The feature extraction module preferably employs a dual-stream convolutional neural network whose first two convolutional blocks are modality-specific (e.g., convolutional blocks dedicated to processing visible light) to capture modality-specific low-level features (low-level features are higher in resolution, contain more positional, detailed information, but are semantically lower, more noisy due to less convolution passing), while parameters of the deep convolutional blocks are modality-shared (common to both visible and infrared).

In operation S220, each of the segments in the visible light image feature map is averaged and pooled to obtain a visible light feature, and each of the segments in the infrared image feature map is averaged and pooled to obtain an infrared feature.

In operation S230, the visible light features and the infrared features of the pedestrian are reconstructed by using the mode specific memory network module, so as to obtain the visible light reconstruction features and the infrared reconstruction features of the pedestrian.

The mode-specific memory network module is used for storing prototype features of each mode (visible light or infrared), and the mode-specific memory network module can be used for storing and transmitting the visible light reconstruction features and the infrared reconstruction features of the pedestrians

In operation S240, the visible light feature, the infrared feature, the visible light reconstruction feature, and the infrared reconstruction feature of the pedestrian are processed by using the unified feature alignment module, so as to obtain a multi-modal unified characterization of the pedestrian.

In operation S250, the cross-modal model is optimized according to the preset loss function by using the visible light feature, the infrared feature, the visible light reconstruction feature, the infrared reconstruction feature and the multi-modal unified characterization of the pedestrian until the value of the preset loss function meets the preset condition, and the cross-modal model based on the modal specific memory network is obtained.

According to the training method of the cross-modal model based on the modal specific memory network, the visible light image characteristic and the infrared image characteristic of the pedestrian are obtained by processing the visible light image and the infrared image of the pedestrian, the visible light image characteristic and the infrared image characteristic are reconstructed by using the modal specific memory network, so that the visible light and the infrared reconstruction characteristic of the pedestrian are obtained, the reconstruction characteristic is processed by using the unified alignment module, the unified characterization of the visible light and the infrared of the pedestrian is obtained, and then the cross-modal model based on the modal specific memory network is trained and optimized by using the multi-modal characteristic and a preset loss function; and (3) optimizing the model through iterative training to obtain a model with higher recognition accuracy and good recognition effect based on a cross-mode of the mode specific memory network.

The above-mentioned acquisition of the visible light characteristic and the infrared characteristic of the pedestrian will be described in detail with reference to the following embodiments.

For a given image (such as a visible light image of a pedestrian or an infrared image of a pedestrian), a visible light image feature map may be extractedAnd infrared image feature map->Wherein H, W, C represent the height, width and number of channels of the feature map, respectively. Then F is carried out ^V And F ^I The horizontal division is divided into K parts, each part is pooled evenly to obtain local eigenvectors +.> And->Where k=1, 2, …, K.

FIG. 3 is a flow chart for acquiring pedestrian multi-modal reconstruction features in accordance with an embodiment of the invention.

As shown in fig. 3, the multi-modal characteristics of the pedestrian are processed by using the modal specific memory network module, and the obtained multi-modal reconstruction characteristics of the pedestrian include operations S310 to S370.

In operation S310, the visible light characteristic and the infrared characteristic are processed by using the modality specific memory network, respectively, to obtain a visible light memory item and an infrared memory item.

The memory term is each term in the modality specific memory network, and specifically, some representative samples are stored in the memory network.

In operation S320, the cosine similarity between the visible light characteristic and the visible light memory term is calculated, so as to obtain the visible light cosine similarity.

In operation S330, the visible light cosine similarity is normalized to obtain a visible light normalized vector.

The visible light normalization vector is represented by the formulaAnd (3) representing.

In operation S340, an infrared reconstruction feature is obtained from the infrared memory term and the visible light normalization vector.

In operation S350, the cosine similarity between the infrared feature and the infrared memory term is calculated to obtain the infrared cosine similarity.

In operation S360, the infrared cosine similarity is normalized to obtain an infrared normalized vector.

The infrared normalized vector is represented by the formulaAnd (3) representing.

In operation S370, a visible light reconstruction feature is obtained from the visible light memory term and the infrared normalization vector.

The above-described multi-modal reconstruction feature of obtaining a pedestrian is described in further detail below in connection with the detailed description.

The mode specific memory network module is used for accurately storing and transmitting information between a visible light mode and an infrared mode and obtaining uniform characteristic representation. Given an input image (e.g., a visible or infrared image), it can be read from the memory network to reconstruct missing modal features. For example, given a visible image, its infrared characteristics can be reconstructed. To achieve this goal, modality-specific memory terms are introducedAnd->Here, N represents the number of memory terms each portion uses to model a local change. Modality-specific memory items (specific memory items such as those specific to visible light) are arranged in pairs, each item corresponding to a prototype feature of a visible or infrared modality.

where τ represents the visible light temperature coefficient.

where τ represents the infrared temperature coefficient.

The visible light reconstruction feature and the infrared reconstruction feature of pedestrians can be calculated through the formulas (1) - (6), and the multi-mode reconstruction feature which is calculated and processed according to the formulas can play a role in mutual mapping and comparison in the cross-mode recognition process, so that the cross-mode recognition efficiency is improved.

FIG. 4 is a flow chart for obtaining a multimodal unified characterization of a pedestrian in accordance with an embodiment of the invention.

As shown in fig. 4, the above-mentioned processing the multi-modal feature of the pedestrian and the multi-modal reconstruction feature of the pedestrian by using the unified feature alignment module, to obtain the multi-modal unified characterization of the pedestrian includes operations S410 to S420.

In operation S410, the visible light features and the infrared reconstruction features are fused by using the unified feature alignment module, so as to obtain a visible light unified characterization.

In operation S420, the infrared features and the visible light reconstruction features are fused by using the unified feature alignment module, so as to obtain an infrared unified characterization.

After the reconstructed missing modal characteristics of the pedestrians are obtained, adding the reconstructed missing modal characteristics into the input characteristics to obtain unified characteristic representation:wherein (1)>Representing a visible light unification +.>Indicating infrared unified characterization, h (·) is a fusion layer consisting of a linear layer and a batch normalization layer. By fusing the original features and the reconstructed modality features, the visible and infrared images are naturally embedded into a common feature space.

Through the various loss functions, the optimization efficiency and the optimization effect of the cross-modal model based on the modal specific memory network can be improved.

the mode unified characterization classification loss function is used for predicting the identity of pedestrians.

the mode feature classification loss function is used to discriminate local features from both modes (visible and infrared).

the reconstruction coincidence loss function is used for making the reconstructed characteristics of the memory network consistent with the extracted characteristics of the backbone network, and utilizes two modal discriminatorsAnd->Reconstruction modality characteristics->And->Classification is performed.

Wherein the reconstruction loss function is determined by equation (11):

the reconstruction loss function described above is used to ensure that input features can be reconstructed from memory terms from the same modality. Firstly, obtaining a reconstructed input characteristic:the euclidean distance between the input feature and the reconstructed input feature is then minimized.

the mode-specific memory term loss function is used for aligning the corresponding relation between the memory terms of the visible light and infrared modes, wherein D _KL (. Cndot.) represents the KL divergence.

since the memory items store the prototype features of each modality, they should have sufficient recognition to represent the various modes of the pedestrian image. The mode specific memory term discrimination loss function is used for enabling the multi-mode memory term to have the resolution.

FIG. 5 is a training framework diagram of a cross-modal model based on a modality specific memory network according to an embodiment of the present invention.

The training process of the model is described in further detail below in conjunction with fig. 5.

As shown in fig. 5, the inputs to the model are a visible light image and an infrared image of a pedestrian. First, feature extraction modules of the model respectively processVisible and infrared images for capturing visible and infrared features of pedestrians, in the course of which the associated loss function (e.g. the arbiter D ^V And D ^I ) Can be used for optimizing the result of the feature extraction model; secondly, the visible light characteristic and the infrared characteristic are input into a mode specific memory network module (the mode specific memory network module refers to a neural network special for a certain mode, such as a neural network special for processing the visible light mode), and the module can obtain specific memory items (such as the visible light memory items) of different modes in the multimode and obtain a multimode reconstruction characteristic; and finally, fusing the multimode reconstruction features and the multimode features by using a unified feature alignment module to obtain the multimode unified characterization. The training framework does not need an image generation process, and the whole network can be trained end to end; the method relieves the modal difference problem by supplementing the modal missing feature through the modal specific memory network, the missing modal feature can be supplemented by only using single-modal input, and the unified feature space is obtained by aggregating the original and missing modal features, so that the modal difference problem can be relieved well.

According to the pedestrian re-recognition method, the trained cross-modal model based on the mode specific memory network is obtained through the cross-modal model based on the mode specific memory network, and the trained cross-modal model based on the mode specific memory network is utilized to re-recognize pedestrians, so that missing modal information of the pedestrians can be complemented according to input single-mode pedestrian images, further whether pedestrian images in different modes belong to the same pedestrians is judged, accuracy of pedestrian re-recognition is improved, meanwhile, the method can be widely applied to scenes such as security and protection systems and smart cities, can also be installed on front-end equipment in a software mode, and provides real-time visible light-near infrared pedestrian image matching or a background server deployed on a company, so that a large number of visible light-near infrared pedestrian image searching and matching results are provided.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present invention includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. The processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 601 may also include on-board memory for caching purposes. Processor 601 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the invention.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. The processor 601 performs various operations of the method flow according to an embodiment of the present invention by executing programs in the ROM 602 and/or the RAM 603. Note that the program may be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in one or more memories.

According to an embodiment of the invention, the electronic device 600 may also include an input/output (I/O) interface 605, the input/output (I/O) interface 605 also being connected to the bus 604. The electronic device 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

The present invention also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present invention.

According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, the computer-readable storage medium may include ROM 602 and/or RAM 603 and/or one or more memories other than ROM 602 and RAM 603 described above.

Embodiments of the present invention also include a computer program product comprising a computer program containing program code for performing the method shown in the flowcharts. When the computer program product runs in a computer system, the program code is used for enabling the computer system to realize the cross-mode pedestrian re-identification method based on the mode specific memory network and the training method based on the cross-mode model of the mode specific memory network.

The above-described functions defined in the system/apparatus of the embodiment of the present invention are performed when the computer program is executed by the processor 601. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the invention.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and/or installed from the removable medium 611. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the embodiment of the present invention are performed when the computer program is executed by the processor 601. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the invention.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not meant to limit the scope of the invention, but to limit the invention thereto.

Claims

1. A cross-mode pedestrian re-identification method based on a mode specific memory network comprises the following steps:

acquiring a pedestrian image to be re-identified and a re-identification type;

according to the re-recognition type, the pedestrian image to be re-recognized is processed by using the cross-modal model based on the modal specific memory network to obtain a re-recognition result, wherein the cross-modal model based on the modal specific memory network is trained by the following method:

carrying out average pooling on each divided part in the visible light image characteristic graph to obtain visible light characteristics, and carrying out average pooling on each divided part in the infrared image characteristic graph to obtain infrared characteristics;

reconstructing the visible light features and the infrared features of the pedestrian by using a mode specific memory network module to obtain visible light reconstruction features and infrared reconstruction features of the pedestrian, wherein the mode specific memory network module is used for storing and transmitting the visible light reconstruction features and the infrared reconstruction features of the pedestrian;

processing the visible light features, the infrared features, the visible light reconstruction features and the infrared reconstruction features of the pedestrians by using a unified feature alignment module to obtain multi-mode unified characterization of the pedestrians, wherein the multi-mode unified characterization comprises visible light unified characterization and infrared unified characterization;

and optimizing a cross-modal model according to a preset loss function by utilizing the visible light characteristic, the infrared characteristic, the visible light reconstruction characteristic, the infrared reconstruction characteristic and the multi-modal unified characterization of the pedestrian until the value of the preset loss function meets a preset condition, so as to obtain the trained cross-modal model based on the modal specific memory network.

2. The method of claim 1, wherein the reconstructing the visible light features and the infrared features of the pedestrian using a modality specific memory network module, resulting in visible light reconstructed features and infrared reconstructed features of the pedestrian, comprises:

respectively processing the visible light characteristic and the infrared characteristic by using the mode specific memory network to obtain a visible light memory item and an infrared memory item;

obtaining the infrared reconstruction characteristics according to the infrared memory items and the visible light normalization vector;

calculating cosine similarity of the infrared characteristics and the infrared memory term to obtain infrared cosine similarity;

normalizing the infrared cosine similarity to obtain an infrared normalized vector;

and obtaining the visible light reconstruction feature according to the visible light memory term and the infrared normalization vector.

3. The method of claim 2, wherein the visible cosine similarity is determined by formula (1):

wherein,representing the visible light characteristic, +.>Representing a visible light memory item;

where τ represents the visible light temperature coefficient.

4. The method of claim 2, wherein the infrared cosine similarity is determined by equation (4):

wherein,representing said infrared characteristic,/->Representing an infrared memory term;

where τ represents the infrared temperature coefficient.

5. The method of claim 1, wherein the processing the visible light features, the infrared features, the visible light reconstruction features, and the infrared reconstruction features of the pedestrian with a unified feature alignment module, resulting in a multimodal unified characterization of the pedestrian comprises:

6. The method of claim 1, wherein the preset loss function is determined by equation (7):

wherein,representing a modality unified characterization classification loss function, +.>Representing a modal feature classification loss function, < >>Representing the central triplet loss function,/->Representing a reconstructed coincidence loss function->Representing a modality specific memory term loss function, +.>Identifying a loss function representing a modality specific memory term, +.>Representing the reconstruction loss function, lambda _align Weighting coefficients, lambda, representing mode specific memory term loss functions _dis Weighting coefficients, lambda, representing a modal specific memory term discrimination loss function _rec Representing the weighting coefficients of the reconstruction loss function.

7. The method of claim 6, wherein the modality unified characterization classification loss function is determined by equation (8):

wherein the reconstruction consistent loss function is determined by equation (10):

wherein the reconstruction loss function is determined by equation (11):

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.