CN114882525A

CN114882525A - Cross-modal pedestrian re-identification method based on modal specific memory network

Info

Publication number: CN114882525A
Application number: CN202210426984.4A
Authority: CN
Inventors: 张天柱; 刘翔; 张勇东; 李昱霖; 吴枫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-09
Anticipated expiration: 2042-04-21
Also published as: CN114882525B

Abstract

The invention provides a cross-modal pedestrian re-identification method based on a modal specific memory network, which comprises the following steps: acquiring a pedestrian image to be re-identified and a re-identification type; and processing the pedestrian image to be re-identified by using a cross-modal pedestrian re-identification model based on a modal specific memory network according to the re-identification type to obtain a re-identification result. The invention also provides electronic equipment, a storage medium and a computer program product for realizing the cross-modal pedestrian re-identification method based on the modal specific memory network.

Description

Cross-modal pedestrian re-identification method based on modal specific memory network

Technical Field

The invention relates to the field of computer vision, in particular to a cross-modal pedestrian re-identification method, a re-identification device, electronic equipment and a storage medium based on a modal specific memory network.

Background

Pedestrian re-identification is a technique to match pedestrian images at different camera perspectives. The pedestrian re-identification technology can be combined with pedestrian detection and pedestrian tracking technologies, and is widely applied to video monitoring, intelligent security, criminal investigation and the like.

However, the method for re-identifying the pedestrian in the prior art cannot fully utilize the cross-modal information of the pedestrian for identification, or the cross-modal identification method has the problems of low identification accuracy, poor identification effect and the like.

Disclosure of Invention

In view of the foregoing, the present invention provides a method, an electronic device, a storage medium, and a computer program product for cross-modal model training based on a modality-specific memory network, so as to solve at least one of the above problems.

According to an embodiment of the present invention, a cross-modal pedestrian re-identification method based on a modal specific memory network is provided, including:

acquiring a pedestrian image to be re-identified and a re-identification type;

according to the re-recognition type, processing the pedestrian image to be re-recognized by using a cross-modal model based on a modal specific memory network to obtain a re-recognition result, wherein the cross-modal model based on the modal specific memory network is obtained by training in the following method:

respectively processing a visible light image and an infrared image of a pedestrian by using a feature extraction module to obtain a visible light image feature map and an infrared image feature map;

carrying out average pooling on each segmentation part in the visible light image feature map to obtain visible light features, and carrying out average pooling on each segmentation part in the infrared image feature map to obtain infrared features;

reconstructing visible light characteristics and infrared characteristics of the pedestrian by using a mode specific memory network module to obtain visible light reconstruction characteristics and infrared reconstruction characteristics of the pedestrian, wherein the mode specific memory network module is used for storing and transmitting the visible light reconstruction characteristics and the infrared reconstruction characteristics of the pedestrian;

processing visible light characteristics, infrared characteristics, visible light reconstruction characteristics and infrared reconstruction characteristics of the pedestrians by using the unified characteristic alignment module to obtain multi-modal unified characteristics of the pedestrians, wherein the multi-modal unified characteristics comprise the visible light unified characteristics and the infrared unified characteristics;

and optimizing a cross-modal model according to a preset loss function by utilizing the visible light characteristic, the infrared characteristic, the visible light reconstruction characteristic, the infrared reconstruction characteristic and the multi-modal unified representation of the pedestrian until the value of the preset loss function meets a preset condition, and obtaining the trained cross-modal model based on the modal specific memory network.

According to an embodiment of the present invention, the reconstructing the visible light feature and the infrared feature of the pedestrian by using the mode specific memory network module to obtain the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian includes:

respectively processing the visible light characteristics and the infrared characteristics by using a modal specific memory network to obtain visible light memory items and infrared memory items;

calculating the cosine similarity of the visible light characteristics and the visible light memory term to obtain the cosine similarity of the visible light;

carrying out normalization processing on the visible light cosine similarity to obtain a visible light normalization vector;

acquiring infrared reconstruction characteristics according to the infrared memory term and the visible light normalization vector;

calculating the cosine similarity of the infrared features and the infrared memory items to obtain the infrared cosine similarity;

carrying out normalization processing on the infrared cosine similarity to obtain an infrared normalized vector;

and obtaining visible light reconstruction characteristics according to the visible light memory term and the infrared normalized vector.

According to the embodiment of the present invention, the above-mentioned visible light cosine similarity is determined by formula (1):

wherein the content of the first and second substances,

the characteristics of the visible light are represented,

representing a visible-light memory item;

wherein the infrared reconstruction characteristic is determined by formula (2):

wherein the content of the first and second substances,

the infrared memory item is represented by a character string,

a k-th value representing an n-dimensional visible light normalization vector,

determined by equation (3):

where τ represents the visible light temperature coefficient.

According to the embodiment of the present invention, the infrared cosine similarity is determined by formula (4):

wherein the content of the first and second substances,

the characteristic of the infrared light is represented,

representing an infrared memory item;

wherein the visible light reconstruction characteristic is determined by equation (5):

wherein the content of the first and second substances,

the visual-light memory item is represented,

a k-th value representing an n-dimensional infrared normalized vector,

determined by equation (6):

wherein τ represents an infrared temperature coefficient.

According to the embodiment of the present invention, the processing of the visible light feature, the infrared feature, the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian by using the unified feature alignment module to obtain the multi-modal unified characterization of the pedestrian includes:

fusing the visible light characteristic and the infrared reconstruction characteristic by using a unified characteristic alignment module to obtain a visible light unified characteristic;

and fusing the infrared characteristic and the visible light reconstruction characteristic by using the unified characteristic alignment module to obtain an infrared unified characterization.

According to the embodiment of the present invention, the preset loss function is determined by equation (7):

wherein the content of the first and second substances,

the representation modalities uniformly characterize the classification loss function,

a function representing a modal characteristic classification penalty function,

a central triplet of loss functions is represented,

a reconstructed uniform loss function is represented that is,

representing a mode-specific memory term loss function,

representing a mode-specific memory term discriminant loss function,

representing a reconstruction loss function, λ _align Weighting factor, lambda, representing a loss function of a mode-specific memory term _dis Weighting factor, lambda, representing the discriminant loss function of the mode-specific memory term _rec Representing the weighting coefficients of the reconstruction loss function.

According to the embodiment of the present invention, the above-mentioned modal unified characterization classification loss function is determined by formula (8):

wherein the modal feature classification loss function is determined by equation (9):

wherein the reconstructed uniform loss function is determined by equation (10):

wherein the reconstruction loss function is determined by equation (11):

wherein the mode-specific memory term loss function is determined by equation (12):

wherein the mode specific memory term discriminant loss function is determined by equation (13):

wherein, y ^V Visible light image tag, y, representing a pedestrian ^I Infrared image tag of pedestrian, f ^V Representing a characteristic of visible light, f ^I The characteristic of the infrared light is represented,

the visible light reconstruction characteristic is represented and,

representing infrared reconstruction characteristics, E { V, I } representing visible light characteristics or infrared characteristicsSymbol, m ^* Representing a memory item, A ^V Normalized vector of visible light, A ^I Representing the infrared normalized vector.

According to an embodiment of the present invention, there is provided an electronic apparatus including:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a cross-modal pedestrian re-identification method based on a modal-specific memory network as described above.

According to an embodiment of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to execute the above-mentioned cross-modal pedestrian re-identification method based on a modal-specific memory network.

According to an embodiment of the present invention, there is provided a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for cross-modal pedestrian re-identification based on a modal-specific memory network is implemented.

The cross-modal pedestrian re-identification method provided by the invention is based on the modal specific memory network, and the cross-modal characteristics of the pedestrian are processed through a pre-trained cross-modal pedestrian re-identification model based on the modal specific memory network, so that the corresponding relation between the visible light modal characteristics and the infrared modal characteristics of the pedestrian is established, and the cross-modal pedestrian re-identification with higher identification accuracy and good identification efficiency is realized.

Drawings

FIG. 1 is a flowchart of a cross-modal pedestrian re-identification method based on a modal-specific memory network according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of training a cross-modal model based on a modal-specific memory network according to an embodiment of the invention;

FIG. 3 is a flow chart for obtaining multi-modal reconstruction features of a pedestrian according to an embodiment of the invention;

FIG. 4 is a flow diagram for obtaining a multi-modal unified characterization of a pedestrian according to an embodiment of the present invention;

FIG. 5 is a diagram of a training framework for a cross-modal model based on a modal-specific memory network, according to an embodiment of the present invention;

fig. 6 schematically shows a block diagram of an electronic device adapted for a cross-modal pedestrian re-identification method based on a modal-specific memory network and a training method of a cross-modal model based on a modal-specific memory network according to an embodiment of the invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The existing pedestrian re-identification method mainly focuses on searching among visible light pedestrian images shot by a common camera under a daytime scene, and can be regarded as a problem of single-mode image matching. However, in an environment with poor lighting conditions such as at night, it is difficult for a general camera to capture effective appearance information of pedestrians. To overcome this limitation, some surveillance cameras can freely switch between visible and infrared modes as lighting conditions change. Therefore, it is necessary to design an effective model to realize pedestrian retrieval between visible light and infrared images, i.e., to solve the problem of cross-modal pedestrian reconstruction.

Current cross-modal pedestrian re-identification methods can be generally categorized into two categories: a modality sharing feature learning class method and a modality information completion class method. Modality-shared feature learning class methods attempt to embed images of different modalities into a shared feature space. However, since the visual appearance of visible and infrared images is very different, how to embed images of different morphologies directly into a shared feature space remains a difficult problem. Further, since the modality information such as the color of the visible light image is regarded as redundant information by this type of method, the discriminativity of the feature representation by the modality sharing feature learning type method is limited. To solve this problem, a method of modality information completion class is proposed, the object of which is to complete modality information of another with information of an input modality. However, since the model only uses a single modality for input, it is difficult to fill in the missing modality information to solve the modality difference problem.

In view of the above, the present application provides a cross-modal model training method based on a modal-specific memory network, a pedestrian re-identification method, and an electronic device. According to the pedestrian re-identification method, the cross-modal model based on the modal specific memory network is obtained through the cross-modal model training method based on the modal specific memory network, so that missing modal information completion is achieved, the problem of modal difference in cross-modal pedestrian re-identification is solved, and whether pedestrian images in different modalities belong to the same pedestrian or not is judged.

In the technical scheme of the invention, the acquisition, storage, application and the like of the related pedestrian information all accord with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

Fig. 1 is a flowchart of a cross-modal pedestrian re-identification method based on a modal-specific memory network according to an embodiment of the present invention.

As shown in fig. 1, the pedestrian re-identification method includes operations S110 to S120.

In operation S110, acquiring a pedestrian image to be re-identified and a re-identification type;

in operation S120, according to the re-recognition type, the pedestrian image to be re-recognized is processed by using the cross-modal model based on the modal specific memory network, so as to obtain a re-recognition result.

FIG. 2 is a flowchart of a training method for obtaining a cross-modal model based on a modal-specific memory network, according to an embodiment of the invention.

As shown in fig. 2, the method for training the cross-modal model based on the modal-specific memory network includes operations S210 to S250.

In operation S210, the visible light image and the infrared image of the pedestrian are respectively processed by the feature extraction module to obtain a visible light image feature map and an infrared image feature map.

The feature extraction module preferably employs a dual-stream convolutional neural network whose first two convolutional blocks are modality-specific (e.g., those dedicated to processing visible light) to capture modality-specific low-level features (higher resolution of low-level features, containing more location, detail information, but less semantic, more noisy due to fewer convolutions passed) patterns, while the parameters of the deep convolutional blocks are shared by both modalities (both visible and infrared).

In operation S220, each of the segments in the visible-light image feature map is averaged and pooled to obtain visible-light features, and each of the segments in the infrared image feature map is averaged and pooled to obtain infrared features.

In operation S230, the visible light feature and the infrared feature of the pedestrian are reconstructed by using the modality specific memory network module, so as to obtain the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian.

The mode specific memory network module is used for storing prototype characteristics of each mode (visible light or infrared), and simultaneously, the mode specific memory network module can be used for storing and transmitting the visible light reconstruction characteristics and the infrared reconstruction characteristics of the pedestrian

In operation S240, the visible light feature, the infrared feature, the visible light reconstruction feature, and the infrared reconstruction feature of the pedestrian are processed by using the unified feature alignment module, so as to obtain a multi-modal unified representation of the pedestrian.

In operation S250, a cross-modal model based on a modal specific memory network is obtained by optimizing the cross-modal model according to a preset loss function by using the visible light feature, the infrared feature, the visible light reconstruction feature, the infrared reconstruction feature, and the multi-modal unified representation of the pedestrian until a value of the preset loss function satisfies a preset condition.

According to the trans-modal model training method based on the modal specific memory network, the visible light image characteristics and the infrared image characteristics of pedestrians are obtained by processing the visible light images and the infrared images of the pedestrians, the visible light image characteristics and the infrared image characteristics are reconstructed by using the modal specific memory network, so that the visible light and infrared reconstruction characteristics of the pedestrians are obtained, meanwhile, the reconstruction characteristics are processed by using the uniform alignment module, the visible light and infrared uniform characteristics of the pedestrians are obtained, and then the multi-modal characteristics and the preset loss function are used for training and optimizing the trans-modal model based on the modal specific memory network; the model is optimized through iterative training, and the cross-modal model based on the modal specific memory network with high recognition accuracy and good recognition effect is obtained.

The above-described acquisition of the visible light characteristic and the infrared characteristic of the pedestrian is described in detail below with reference to specific embodiments.

For a given image (such as a visible light image of a pedestrian or an infrared image of a pedestrian), a visible light image feature map can be extracted

And infrared image feature map

Wherein, H, W, C respectively represent the height, width and number of channels of the characteristic diagram. Then F is mixed ^V And F ^I Dividing the horizontal direction into K parts, and pooling each part to obtain local feature vectors

And

wherein K is 1,2, …, K.

FIG. 3 is a flow chart for obtaining multi-modal reconstruction features of pedestrians according to an embodiment of the invention.

As shown in fig. 3, the processing of the multi-modal characteristics of the pedestrian by using the modality-specific memory network module to obtain the multi-modal reconstructed characteristics of the pedestrian includes operations S310 to S370.

In operation S310, the visible light feature and the infrared feature are respectively processed by using the modality specific memory network, so as to obtain a visible light memory item and an infrared memory item.

The memory items are each item in the modality specific memory network, and specifically, representative samples are stored in the memory network.

In operation S320, the cosine similarity between the visible light feature and the visible light memory term is calculated to obtain the visible light cosine similarity.

In operation S330, the visible light cosine similarity is normalized to obtain a visible light normalization vector.

The above formula for normalizing the vector of the visible light

And (4) showing.

In operation S340, an infrared reconstruction feature is obtained according to the infrared memory term and the visible light normalization vector.

In operation S350, the cosine similarity between the infrared feature and the infrared memory term is calculated to obtain the infrared cosine similarity.

In operation S360, the infrared cosine similarity is normalized to obtain an infrared normalized vector.

The above infrared normalized vector is represented by the formula

And (4) showing.

In operation S370, a visible light reconstruction feature is obtained according to the visible light memory term and the infrared normalization vector.

The above-described multi-modal reconstruction features for pedestrians are described in further detail below with reference to specific embodiments.

The mode specific memory network module is used for accurately storing and transmitting information between a visible light mode and an infrared mode and obtaining unificationIs shown. Given an input image (e.g., a visible light image or an infrared image), it may be read from the memory network to reconstruct missing modal features. For example, given a visible light image, its infrared signature can be reconstructed. To achieve this goal, modality-specific memory items are introduced

And

here, N denotes the number of memory items each part uses to model a local change. Modality-specific memory items (specific memory items such as those specific to visible light) are arranged in pairs, each item corresponding to a prototype feature of a visible or infrared modality.

wherein the content of the first and second substances,

the characteristics of the visible light are represented,

representing a visible-light memory item;

wherein the content of the first and second substances,

the infrared memory item is represented by a character string,

a k-th value representing an n-dimensional visible light normalization vector,

determined by equation (3):

where τ represents the visible light temperature coefficient.

wherein the content of the first and second substances,

the characteristic of the infrared light is represented,

representing an infrared memory item;

wherein the content of the first and second substances,

the visual-light memory item is represented,

a k-th value representing an n-dimensional infrared normalized vector,

determined by equation (6):

wherein τ represents an infrared temperature coefficient.

The visible light reconstruction characteristics and the infrared reconstruction characteristics of pedestrians can be respectively calculated through the formulas (1) to (6), and the multi-mode reconstruction characteristics calculated and processed according to the formulas can play a role in mutual mapping and comparison in the cross-mode recognition process, so that the cross-mode recognition efficiency is improved.

FIG. 4 is a flow diagram for obtaining a multi-modal unified characterization of a pedestrian according to an embodiment of the present invention.

As shown in fig. 4, the processing of the multi-modal features of the pedestrian and the multi-modal reconstructed features of the pedestrian by the unified feature alignment module to obtain the multi-modal unified representation of the pedestrian includes operations S410 to S420.

In operation S410, the visible light feature and the infrared reconstruction feature are fused by using the unified feature alignment module to obtain a unified visible light feature.

In operation S420, the infrared feature and the visible light reconstruction feature are fused by using the unified feature alignment module, so as to obtain an infrared unified representation.

After the reconstructed missing modal characteristics of the pedestrian are obtained, the reconstructed missing modal characteristics are added into the input characteristics to obtain unified characteristic representation:

wherein the content of the first and second substances,

it is shown that the visible light is uniformly characterized,

representing the infrared unified characterization, h (-) is a fusion layer consisting of a linear layer and a batch normalization layer. By fusing the original features and the reconstructed modality features, the visible and infrared images are naturally embedded into a common feature space.

wherein the content of the first and second substances,

a function representing a modal characteristic classification penalty function,

a central triplet of loss functions is represented,

a reconstructed uniform loss function is represented that is,

representing a mode-specific memory term loss function,

representing a mode-specific memory term discriminant loss function,

Through the various loss functions, the optimization efficiency and the optimization effect of the cross-modal model based on the modal specific memory network can be improved.

the above-described modalities collectively characterize a classification loss function for predicting the identity of a pedestrian.

the modal feature classification loss function described above is used to make local features from both modalities (visible and infrared) discriminative.

Wherein the reconstructed uniform loss function is determined by equation (10):

the reconstruction consistency loss function is used for ensuring that the characteristics of the memory network reconstruction are consistent with the characteristics extracted by the backbone network, and two mode discriminators are utilized

And

to reconstructed modal characteristics

And

and (6) classifying.

Wherein the reconstruction loss function is determined by equation (11):

the above reconstruction loss function is used to ensure that the free phase can be usedThe homomodal memory terms reconstruct the input features. First, the reconstructed input features are obtained:

the euclidean distance between the input features and the reconstructed input features is then minimized.

the above-mentioned mode-specific memory term loss function is used for aligning the correspondence between the memory terms of the visible light and infrared modes, wherein D _KL (. -) represents the KL divergence.

since the memory terms store prototypical features of each modality, they should have sufficient discriminative power to represent various patterns of pedestrian images. The above-mentioned modality-specific memory term discriminant loss function is used to make the multi-modal memory term distinguishable.

a visible light reconstruction characteristic is represented and,

representing infrared reconstruction characteristics, ∈ { V, I } representing visible light characteristics or infrared characteristics, m ^* Representing a memory item, A ^V Normalized vector of visible light, A ^I Representing the infrared normalized vector.

FIG. 5 is a diagram of a training framework for a cross-modal model based on a modal-specific memory network, according to an embodiment of the invention.

The training process of the above model is described in further detail below with reference to fig. 5.

As shown in fig. 5, the inputs to the above model are visible light images and infrared images of the pedestrian. Firstly, the feature extraction module of the model processes the visible light image and the infrared image respectively for obtaining the visible light feature and the infrared feature of the pedestrian, and in the process, the related loss function (such as the discriminator D) ^V And D ^I ) Can be used to optimize the results of the feature extraction model; secondly, inputting the visible light characteristics and the infrared characteristics into a mode-specific memory network module (the mode-specific memory network module refers to a neural network specially used for a certain mode, such as a neural network specially used for processing the visible light modes), and obtaining specific memory items (such as visible light memory items) of different modes in the multiple modes and obtaining multi-mode reconstruction characteristics in the module; and finally, fusing the multi-modal reconstruction features and the multi-modal features by using a unified feature alignment module to obtain a multi-modal unified representation. The training framework does not need an image generation process, and the whole network can be trained end to end; the method relieves the modal difference problem by complementing the modal loss characteristics through the modal specific memory network, can complement the loss modal characteristics only by using single-modal input, obtains a uniform characteristic space by aggregating the original and loss modal characteristics, and can well relieve the modal difference problem.

According to the pedestrian re-identification method, the trained trans-modal model based on the modal specific memory network is obtained through the training method of the trans-modal model based on the modal specific memory network, the trained trans-modal model based on the modal specific memory network is utilized to re-identify pedestrians, missing modal information of the pedestrians can be completed according to an input single-modal pedestrian image, whether the pedestrian images in different modalities belong to the same pedestrian or not is further judged, accuracy of pedestrian re-identification is improved, meanwhile, the method can be widely applied to scenes such as security systems and smart cities, can also be installed on front-end equipment in a software mode, real-time visible light-near infrared pedestrian images are matched or deployed on a background server of a company, and large-batch visible light-near infrared pedestrian image retrieval and matching results are provided.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present invention includes a processor 601 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. Processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 601 may also include onboard memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present invention by executing programs in the ROM 602 and/or RAM 603. It is to be noted that the program may also be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in one or more memories.

Electronic device 600 may also include input/output (I/O) interface 605, where input/output (I/O) interface 605 is also connected to bus 604, according to an embodiment of the invention. The electronic device 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.

According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the present invention, a computer-readable storage medium may include the ROM 602 and/or the RAM 603 described above and/or one or more memories other than the ROM 602 and the RAM 603.

Embodiments of the invention also include a computer program product comprising a computer program comprising program code for performing the method illustrated in the flow chart. When the computer program product runs in a computer system, the program code is used for causing the computer system to realize the cross-modal pedestrian re-identification method based on the modal-specific memory network and the training method based on the cross-modal model of the modal-specific memory network, which are provided by the embodiment of the invention.

The computer program performs the above-described functions defined in the system/apparatus of the embodiment of the present invention when executed by the processor 601. The above described systems, devices, modules, units, etc. may be implemented by computer program modules according to embodiments of the present invention.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 609, and/or installed from the removable medium 611. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the system of the embodiment of the present invention. The above described systems, devices, apparatuses, modules, units, etc. may be implemented by computer program modules according to embodiments of the present invention.

The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above embodiments are only examples of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A cross-modal pedestrian re-identification method based on a modal specific memory network comprises the following steps:

acquiring a pedestrian image to be re-identified and a re-identification type;

according to the re-recognition type, processing the pedestrian image to be re-recognized by using the cross-modal model based on the modal specific memory network to obtain a re-recognition result, wherein the cross-modal model based on the modal specific memory network is obtained by training in the following method:

performing average pooling on each segmentation part in the visible light image feature map to obtain visible light features, and performing average pooling on each segmentation part in the infrared image feature map to obtain infrared features;

reconstructing the visible light characteristics and the infrared characteristics of the pedestrian by using a mode specific memory network module to obtain visible light reconstruction characteristics and infrared reconstruction characteristics of the pedestrian, wherein the mode specific memory network module is used for storing and transmitting the visible light reconstruction characteristics and the infrared reconstruction characteristics of the pedestrian;

processing the visible light feature, the infrared feature, the visible light reconstruction feature and the infrared reconstruction feature of the pedestrian by using a unified feature alignment module to obtain a multi-modal unified representation of the pedestrian, wherein the multi-modal unified representation comprises a visible light unified representation and an infrared unified representation;

and optimizing a cross-modal model according to a preset loss function by using the visible light characteristic, the infrared characteristic, the visible light reconstruction characteristic, the infrared reconstruction characteristic and the multi-modal unified representation of the pedestrian until the value of the preset loss function meets a preset condition, and obtaining the trained cross-modal model based on the modal specific memory network.

2. The method of claim 1, wherein the reconstructing the visible light features and the infrared features of the pedestrian using a modality-specific memory network module, the obtaining visible light reconstructed features and infrared reconstructed features of the pedestrian comprises:

respectively processing the visible light characteristics and the infrared characteristics by using the mode specific memory network to obtain visible light memory items and infrared memory items;

normalizing the visible light cosine similarity to obtain a visible light normalized vector;

acquiring the infrared reconstruction characteristics according to the infrared memory items and the visible light normalized vector;

and obtaining the visible light reconstruction characteristics according to the visible light memory term and the infrared normalized vector.

3. The method of claim 2, wherein the visible light cosine similarity is determined by equation (1):

wherein the content of the first and second substances,

the visible light characteristic is represented by a visible light,

representing a visible memory item;

wherein the infrared reconstruction characteristic is determined by equation (2):

wherein the content of the first and second substances,

the infrared memory item is represented by a character string,

a k-th value representing an n-dimensional visible light normalization vector,

determined by equation (3):

where τ represents the visible light temperature coefficient.

4. The method of claim 2, wherein the infrared cosine similarity is determined by equation (4):

wherein the content of the first and second substances,

the infrared characteristics are represented by a representation of the infrared,

representing an infrared memory item;

wherein the content of the first and second substances,

the visual-light memory item is represented,

a k-th value representing an n-dimensional infrared normalized vector,

determined by equation (6):

wherein τ represents an infrared temperature coefficient.

5. The method of claim 1, wherein the processing the visible light features, the infrared features, the visible light reconstruction features, and the infrared reconstruction features of the pedestrian with a unified feature alignment module to obtain a multi-modal unified characterization of the pedestrian comprises:

and fusing the infrared characteristic and the visible light reconstruction characteristic by using a unified characteristic alignment module to obtain an infrared unified characteristic.

6. The method of claim 1, wherein the preset loss function is determined by equation (7):

wherein the content of the first and second substances,

display moduleA state-feature classification loss function that,

a central triplet of loss functions is represented,

a reconstructed uniform loss function is represented that is,

representing a mode-specific memory term loss function,

representing a mode-specific memory term discriminant loss function,

7. The method according to claim 6, wherein the modal unified characterization classification loss function is determined by equation (8):

wherein the reconstructed uniform loss function is determined by equation (10):

wherein the reconstruction loss function is determined by equation (11):

wherein the modality-specific memory term loss function is determined by equation (12):

the visible light reconstruction characteristic is represented and,

8. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.

10. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 7.