WO2023279935A1

WO2023279935A1 - Target re-recognition model training method and device, and target re-recognition method and device

Info

Publication number: WO2023279935A1
Application number: PCT/CN2022/099257
Authority: WO
Inventors: 刘武; 梅涛
Original assignee: 京东科技信息技术有限公司
Priority date: 2021-07-06
Filing date: 2022-06-16
Publication date: 2023-01-12
Also published as: CN113408472A; CN113408472B

Abstract

Disclosed are a target re-recognition model training method and device, and a target re-recognition method and device. The target re-recognition model training method comprises: acquiring a plurality of images, the plurality of images respectively having a plurality of corresponding modes and a plurality of corresponding annotated target categories; acquiring a plurality of convolution feature maps respectively corresponding to the plurality of modes, and acquiring a plurality of edge feature maps respectively corresponding to the plurality of modes; acquiring a plurality of pieces of feature distance information respectively corresponding to the plurality of modes; and training an initial re-recognition model according to the plurality of images, the plurality of convolution feature maps, the plurality of edge feature maps, the plurality of pieces of feature distance information, and the plurality of annotated target categories to obtain a target re-recognition model.

Description

Target re-identification model training method, target re-identification method and device

Cross References to Related Applications

This application is based on the Chinese patent application with application number 202110763047.3 and the filing date is July 6, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The present disclosure relates to the technical field of image recognition, and in particular, to a training method of an object re-identification model, an object re-identification method, a device, electronic equipment, a storage medium, a computer program product, and a computer program.

Background technique

As people pay more attention to safety, video surveillance cameras are placed in various environmental scenes of life and work. Common cameras use color video during the day and infrared video at night to record information around the clock.

The cross-modal target re-identification aims to match the targets in the three primary color images (Red Green Blue, RGB) collected by the visible light camera and the infrared images (Infrared Radiation, IR) collected by the infrared camera. Since images of different modalities (RGB and IR) are heterogeneous, modality differences degrade the performance of matching.

When the network model in the related art performs cross-modal target re-identification, the feature mining in RGB images and IR images is not sufficient, and the stability of the model training process is not strong, thus affecting the performance of cross-modal target re-identification. Effect.

Contents of the invention

Embodiments of the present disclosure propose a training method for a target re-identification model, a target re-identification method, a device, an electronic device, a storage medium, a computer program product, and a computer program, aiming to solve technical problems in related technologies at least to a certain extent one.

The embodiment of the first aspect of the present disclosure proposes a training method for a target re-identification model, including: acquiring multiple images, the multiple images respectively have corresponding multiple modalities and corresponding multiple labeled target categories; Multiple convolutional feature maps corresponding to the modalities respectively, and obtain multiple edge feature maps corresponding to the multiple modalities respectively; obtain various feature distance information corresponding to the multiple modalities respectively; and according to multiple images, multiple Convolutional feature maps, multiple edge feature maps, multiple feature distance information, and multiple labeled target categories are used to train the initial re-identification model to obtain the target re-identification model.

In some embodiments, the training initial The re-identification model to obtain the target re-identification model, including:

processing the plurality of images using the initial re-identification model to obtain an initial loss value;

processing the plurality of convolutional feature maps and the plurality of edge feature maps using the initial re-identification model to obtain perceptual edge loss values;

Using the initial re-identification model to process the various feature distance information to obtain a cross-modal center comparison loss value;

The initial re-identification model is trained according to the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value, so as to obtain the target re-identification model.

In some embodiments, the initial re-identification model includes: a first network structure for identifying perceptual loss values between the convolutional feature map and the edge feature map.

In some embodiments, the processing of the plurality of convolutional feature maps and the plurality of edge feature maps using the initial re-identification model to obtain perceptual edge loss values includes:

inputting the plurality of convolutional feature maps and the plurality of edge feature maps into the first network structure to obtain a plurality of convolution loss feature maps respectively corresponding to the plurality of convolutional feature maps, And obtain a plurality of edge loss feature maps respectively corresponding to the plurality of edge feature maps;

determining a plurality of convolution feature map parameters respectively corresponding to the plurality of convolution loss feature maps, and determining a plurality of edge feature map parameters respectively corresponding to the plurality of edge loss feature maps;

Processing the corresponding plurality of convolution loss feature maps according to the plurality of convolution feature map parameters to obtain a plurality of first perceptual edge loss values;

processing the corresponding plurality of edge loss feature maps according to the plurality of edge feature map parameters to obtain a plurality of second perceptual edge loss values; and

The perceptual edge loss value is generated based on the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.

In some embodiments, the initial re-identification model includes: a batch normalization layer, and the acquisition of various feature distance information corresponding to the various modalities includes:

Inputting the plurality of images into the batch normalization layer, respectively, to obtain a plurality of feature vectors respectively corresponding to the plurality of images output by the batch normalization layer;

determining, according to the plurality of feature vectors, feature center points of a plurality of targets respectively corresponding to the plurality of images;

determining a first distance between feature center points of different targets, and determining a second distance between feature center points of the same target corresponding to different modes, the first distance and the second The distances together constitute the various kinds of characteristic distance information.

In some embodiments, the process of using the initial re-identification model to process the various feature distance information to obtain a cross-modal center comparison loss value includes:

Using the initial re-identification model to determine a first target distance from multiple first distances, the first target distance is the first distance with the smallest median value among the multiple first distances;

The cross-modal center comparison loss value is calculated according to the first target distance, multiple second distances, and the number of targets.

In some embodiments, the initial re-identification model includes: a sequentially connected fully connected layer and an output layer, and the processing of the multiple images using the initial re-identification model to obtain an initial loss value includes:

Sequentially input the plurality of images into the fully connected layer and the output layer to obtain a plurality of category feature vectors output by the output layer corresponding to the plurality of images;

determining a plurality of encoding vectors respectively corresponding to the plurality of labeling target categories;

An identity loss value is generated according to the plurality of category feature vectors and the corresponding encoding vectors, and the identity loss value is used as the initial loss value.

In some embodiments, the processing of the plurality of images using the initial re-identification model to obtain an initial loss value includes:

performing image division on the plurality of images with reference to the plurality of labeled target categories to obtain a triplet sample set, the triplet sample set including: the plurality of images, the plurality of first images, and the plurality of second images, the multiple first images correspond to the same labeled target category, and the multiple second images correspond to different labeled target categories;

determining a first Euclidean distance between a feature vector of the image and a feature vector of the first image, the feature vector output by the batch normalization layer;

determining a second Euclidean distance between the feature vector of the image and the feature vector of the second image; and

A ternary loss value is determined according to the plurality of first Euclidean distances and the plurality of second Euclidean distances, and the ternary loss value is used as the initial loss value.

In some embodiments, the initial re-identification model is trained according to the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value to obtain the target re-identification model, include:

generating a target loss value based on the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value;

If the target loss value satisfies the set condition, the re-identification model obtained through training is used as the target re-identification model.

In some embodiments, the plurality of modalities include: a color image modality and an infrared image modality.

The embodiment of the second aspect of the present disclosure proposes a target re-identification method, including: acquiring a reference image and an image to be recognized. The modalities of the reference image and the image to be recognized are different. The reference image includes: a reference category; The recognition images are respectively input into the target re-recognition model trained by the above-mentioned target re-recognition model training method to obtain the target corresponding to the image to be recognized output by the target re-recognition model. The target has a corresponding target category, and the target category and The reference category matches.

The embodiment of the third aspect of the present disclosure proposes a training device for a target re-identification model, including: a first acquisition module, configured to acquire multiple images, each of which has corresponding multiple modalities and corresponding multiple annotations The target category; the second acquisition module is used to obtain multiple convolutional feature maps corresponding to multiple modalities, and to obtain multiple edge feature maps corresponding to multiple modalities; the third acquisition module is used to obtain A variety of feature distance information corresponding to a variety of modalities; and a training module, used for multiple images, multiple convolution feature maps, multiple edge feature maps, multiple feature distance information, and multiple labeling target categories Train the initial re-identification model to obtain the target re-identification model.

In some embodiments, the training module includes:

The first processing submodule is used to process the plurality of images using the initial re-identification model to obtain an initial loss value;

The second processing submodule is used to process the plurality of convolutional feature maps and the plurality of edge feature maps using the initial re-identification model to obtain a perceptual edge loss value;

The third processing submodule is used to process the various feature distance information using the initial re-identification model to obtain a cross-modal center comparison loss value;

The training submodule is configured to train the initial re-identification model according to the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value, so as to obtain the target re-identification model.

In some embodiments, the second processing submodule is specifically used for:

In some embodiments, the initial re-identification model includes: a batch normalization layer, and the third acquisition module includes:

A normalization processing submodule, configured to input the plurality of images into the batch normalization layer respectively, so as to obtain a plurality of feature vectors respectively corresponding to the plurality of images output by the batch normalization layer;

A central point determination submodule, configured to determine, according to the multiple feature vectors, the feature center points of multiple targets corresponding to the multiple images;

A distance determining submodule, configured to determine a first distance between feature center points of different targets, and determine a second distance between feature center points corresponding to different modes of the same target, the first The first distance and the second distance together constitute the various kinds of characteristic distance information.

In some embodiments, the third processing submodule is specifically configured to: use the initial re-identification model to determine a first target distance from multiple first distances, and the first target distance is the multiple first said first distance having the smallest distance from the median;

The cross-modal center contrast loss value is calculated according to the first target distance and a plurality of the second distances, and the number of targets.

In some embodiments, the initial re-identification model includes: a sequentially connected fully connected layer and an output layer, and the first processing submodule is specifically used for:

In some embodiments, the first processing submodule is specifically used for:

In some embodiments, the training submodule is specifically used for:

The embodiment of the fourth aspect of the present disclosure proposes a target re-identification device, including: a fourth acquisition module, configured to acquire a reference image and an image to be recognized. The modalities of the reference image and the image to be recognized are different, and the reference image includes: Category; recognition module, for respectively inputting the reference image and the image to be recognized into the target re-recognition model trained by the above-mentioned target re-recognition model training method, so as to obtain the output corresponding to the image to be recognized by the target re-recognition model A target, a target has a corresponding target class, and the target class matches the reference class.

The embodiment of the fifth aspect of the present disclosure provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be executed by the at least one processor. instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the method for training a target re-identification model described in any one of the embodiments of the present disclosure, or execute any of the embodiments of the present disclosure A method for object re-identification described herein.

The embodiment of the sixth aspect of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute the object re-identification model described in any one of the embodiments of the present disclosure. A training method, or execute the object re-identification method described in any one of the embodiments of the present disclosure.

The embodiment of the seventh aspect of the present disclosure provides a computer program product, the computer program product includes computer program code, when the computer program code is run on the computer, to execute any one of the embodiments of the present disclosure. A method for training a target re-identification model, or execute the target re-identification method described in any one of the embodiments of the present disclosure.

The embodiment of the eighth aspect of the present disclosure provides a computer program, the computer program includes computer program code, when the computer program code is run on the computer, so that the computer executes the object described in any one of the embodiments of the present disclosure A training method for a re-identification model, or execute the target re-identification method described in any one of the embodiments of the present disclosure.

Description of drawings

The above and/or additional aspects and advantages of the present disclosure will become apparent and understandable from the following description of the embodiments in conjunction with the accompanying drawings, wherein:

FIG. 1 is a schematic flowchart of a method for training a target re-identification model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a network structure providing a re-identification model according to an embodiment of the present disclosure;

3 is a schematic flowchart of a method for training a target re-identification model according to another embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a first network structure provided according to an embodiment of the present disclosure;

Fig. 5 is a schematic diagram of a feature space structure of a target provided according to an embodiment of the present disclosure;

6 is a schematic flowchart of a method for training a target re-identification model according to another embodiment of the present disclosure;

Fig. 7 is a training flowchart of a target re-identification model provided according to an embodiment of the present disclosure;

FIG. 8 is a schematic flowchart of a method for re-identifying a target according to another embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a training device for a target re-identification model provided according to another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a training device for a target re-identification model provided according to another embodiment of the present disclosure;

Fig. 11 is a schematic diagram of a target re-identification device provided according to another embodiment of the present disclosure; and

Figure 12 shows a block diagram of an exemplary computer device suitable for implementing embodiments of the present disclosure.

detailed description

Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the drawings, in which the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present disclosure and should not be construed as limiting the present disclosure. On the contrary, the embodiments of the present disclosure include all changes, modifications and equivalents coming within the spirit and scope of the appended claims.

In view of the technical problem that the network model in the related art mentioned in the background technology is not sufficient for feature mining in multi-modal images, which affects the effect of cross-modal target re-identification, the technical solution of the embodiment of the present disclosure provides a The training method of the target re-identification model will be described below in conjunction with specific embodiments.

Wherein, it should be noted that the execution subject of the training method of the target re-identification model in the embodiment of the present disclosure may be the training device of the target re-recognition model, and the device may be realized by software and/or hardware, and the device may be configured in In the electronic equipment, the electronic equipment may include but not limited to a terminal, a server, and the like.

FIG. 1 is a schematic flowchart of a method for training an object re-identification model according to an embodiment of the present disclosure. Referring to Fig. 1, the method includes step S101 to step S104.

S101: Acquire multiple images, each of which has multiple corresponding modalities and multiple corresponding labeled target categories.

Wherein, the multiple images may be images collected by an image collection device in any possible scene, or may also be images obtained from the Internet, which is not limited.

Multiple images have multiple modes, such as: color image mode, infrared image mode, and any other possible image modes, where the color image mode can be RGB mode, infrared image mode It can be an IR mode, and there is no limitation regarding the various modes.

That is to say, multiple images in the embodiments of the present disclosure may have RGB modality and IR modality. In practical applications, an image acquisition device (such as a camera) can be used to collect color images or video frames (RGB mode) during the day, and infrared images or video frames (IR mode) at night, so that multiple modes can be obtained. multiple images of .

There may be multiple target objects in multiple images, for example: pedestrians, vehicles, and any other possible target objects. More specifically, multiple target objects may also be pedestrian 1, pedestrian 2, vehicle 1, vehicle 2, etc., different Pedestrians or vehicles may correspond to different categories, that is to say, the embodiments of the present disclosure may collect multiple images of various modalities for different target objects.

The information used to label the category of the target object can be called the labeling target category, where the labeling target category can be in the form of a score, for example, and different scores represent different types of target objects. By labeling the target category, you can Differentiate target objects in multiple images.

In addition, multiple images can also be divided into a training set (train set) and a test set (test set), which includes the image and the labeled target category corresponding to the image.

S102: Acquire multiple convolutional feature maps corresponding to multiple modalities, and acquire multiple edge feature maps respectively corresponding to multiple modalities.

After the multiple images are acquired above, multiple convolution feature maps and multiple edge feature maps respectively corresponding to multiple modalities are further acquired.

Among them, the feature map obtained by performing convolution operations on images of various modalities may be called a convolution feature map. Embodiments of the present disclosure can use any one or more convolutional layers in the neural network to perform convolution operations on images of various modalities, for example: use the ResNet Layer0 layer of the residual neural network to extract the multiple convolutional feature maps, or Multiple convolutional feature maps can also be obtained in any other possible way, without limitation.

The edge feature map can represent the edge contour information of the target object in images of various modalities. In the embodiment of the present disclosure, for example, a Sobel operator (Sobel operator) can be used to perform convolution operations on multiple images to extract the target object edge information to obtain multiple edge feature maps, or obtain multiple edge feature maps in any other possible way, without limitation.

That is to say, in order to solve the feature difference between the RGB mode and the IR mode, the embodiment of the present disclosure can use the edge contour information of the target object as a guide during the model training process, and optimize the characteristic feature space, thereby realizing Mining of common features between modalities.

S103: Acquire various feature distance information respectively corresponding to multiple modalities.

After the multiple convolutional feature maps and multiple edge feature maps are acquired above, multiple feature distance information corresponding to multiple modalities is further acquired.

Among them, the various feature distance information can be the distance between the feature center points of targets of different marked target categories, and/or the distance between the feature center points of the same target corresponding to different modalities, or any other possible The feature distance information of , without limitation.

For example, in the process of determining various feature distance information, multiple feature vectors corresponding to multiple images can be determined first, and further the feature center point can be determined according to the multiple feature vectors, so that the multiple feature vectors can be determined according to the feature center point. Please refer to the following embodiments for the specific manner of calculating multiple kinds of feature distance information.

S104: Train an initial re-identification model according to multiple images, multiple convolutional feature maps, multiple edge feature maps, multiple feature distance information, and multiple labeled target categories to obtain a target re-identification model.

Wherein, the re-identification model in the embodiment of the present disclosure may be based on a convolutional neural network structure, specifically, a residual neural network ResNet50 may be used as the backbone network of the re-identification model.

Fig. 2 is a schematic diagram of a network structure providing a re-identification model according to an embodiment of the present disclosure. As shown in Figure 2, the embodiment of the present disclosure can divide ResNet50 into two parts, wherein, the convolutional layer (ResNet Layer0) of the initial stage can adopt a dual-stream design, and the convolutional layer (ResNet Layer1-4) of the next four stages ) can use the strategy of dual-stream shared weights to uniformly extract the information of the two modalities.

During the training process, the parameters of the initial re-identification model (ResNet50) can be adjusted according to the relationship between multiple images, multiple convolutional feature maps, multiple edge feature maps, multiple feature distance information, and multiple labeled target categories. Optimize and adjust until the model converges to obtain the target re-identification model.

In the embodiment of the present disclosure, by acquiring multiple images, the multiple images respectively have corresponding multiple modalities and corresponding multiple labeled target categories, and multiple convolutional feature maps corresponding to the multiple modalities are acquired, And obtain multiple edge feature maps corresponding to multiple modalities, and obtain multiple feature distance information corresponding to multiple modalities, and according to multiple images, multiple convolution feature maps, and multiple edge feature maps , a variety of feature distance information, and multiple labeled target categories to train the initial re-identification model to obtain the target re-identification model. Therefore, the trained re-identification model can fully mine the features in various modal images and can enhance different modalities. Improve the accuracy of image matching, thereby improving the effect of cross-modal target re-identification. Furthermore, it solves the technical problem that the network model existing in the related technology is not sufficient for feature mining in multi-modal images, which affects the effect of cross-modal target re-identification.

Fig. 3 is a schematic flowchart of a method for training an object re-identification model according to another embodiment of the present disclosure. Referring to Fig. 3, the method includes step S301 to step S307.

S301: Acquire multiple images, each of which has multiple corresponding modalities and multiple corresponding labeled target categories.

S302: Acquire multiple convolutional feature maps corresponding to multiple modalities, and acquire multiple edge feature maps corresponding to multiple modalities.

S303: Acquire various feature distance information respectively corresponding to multiple modalities.

For specific descriptions of S301-S303, reference may be made to the foregoing embodiments, and details are not repeated here.

S304: Process multiple images using the initial re-identification model to obtain an initial loss value.

In the operation of training the initial re-identification model, first use the initial re-identification model to process multiple images to obtain the initial loss value, for example: the identity loss function (Id Loss) can be used to calculate the initial loss of the initial re-identification model value, or other loss functions can be used to determine the initial loss value, which is not limited.

In some embodiments, as shown in Figure 2, the initial re-identification model may include sequentially connected fully connected layers (fully connected layers, FC) and output layer (for example: Softmax classifier), when using the initial re-identification model In the process of processing multiple images to obtain the initial loss value, firstly, multiple images can be sequentially input into the fully connected layer and the output layer to obtain multiple category feature vectors corresponding to multiple images output by the output layer .

For example, rgb and ir can be used to represent multiple images of various modalities, and let X ^m ={x ^m |x ^m ∈R ^H×W×3 } represent multiple input image sets (training set or test set set), where m∈{rgb,ir}, H and W represent the height and width of the image, respectively, and 3 represents the number of channels of the image (RGB image contains three channels of R\G\B, and IR image repeats its single channel 3 times converted to 3 channels). For example: in the training process, a batch (Batch) contains B pictures, so that

Represents one of the RGB or IR images, then i∈{1,2,...,B}.

As shown in Figure 2, the input image

After the operation of the network model to the final fully connected (FC) layer and the output layer (Softmax), the obtained vector can be called a category feature vector, and the category feature vector can be represented by p _i , for example, then multiple categories corresponding to multiple images The eigenvectors are expressed as

where j ∈ {1,2,...,N}, N is the number of object categories in multiple images.

In some embodiments, a plurality of encoding vectors respectively corresponding to a plurality of labeled object categories are determined, for example, one-hot encoding (one-hot) can be used to encode a plurality of labeled object categories to obtain an encoded vector, and the encoded vector can be represented by y _i Indicates that multiple encoded vectors can be expressed as

In some embodiments, identity loss values are generated according to multiple category feature vectors and corresponding multiple encoding vectors, that is to say, the embodiments of the present disclosure can use the identity loss function (Id Loss) to compare multiple category feature vectors and The corresponding multiple encoding vectors are calculated to obtain the identity loss value, and the identity loss value is used as the initial loss value.

Among them, the identity loss function Id Loss can be expressed as:

It can be understood that, the above example is just an illustration by using the identity loss value as the initial loss value, and other loss functions may also be used to determine the initial loss value in practical applications, which is not limited.

In the embodiment of the present disclosure, the identity loss value is used as the initial loss value, which can make the model have a good pedestrian re-identification effect.

S305: Using the initial re-identification model to process multiple convolutional feature maps and multiple edge feature maps to obtain perceptual edge loss values.

In some embodiments, the initial re-identification model may include a first network structure, and FIG. 4 is a schematic structural diagram of the first network structure provided according to an embodiment of the present disclosure. As shown in FIG. 4, the first network structure can be, for example, a deep convolutional neural network VGGNet-16, which can identify perceptual loss values between edge convolutional feature maps and edge feature maps. Using VGGNet-16 as the first network structure can deeply identify the loss between the convolutional feature map and the edge feature map, thereby improving the accuracy of the perceived loss value.

In some embodiments, as shown in Figure 4, multiple convolutional feature maps extracted by ResNet Layer0 and multiple edge feature maps extracted by Sobel operator can be input into VGGNet-16, wherein VGGNet-16 network uses φ={φ ₁ ,φ ₂ ,φ ₃ ,φ ₄ } represents four stages, through which multiple convolution feature maps can obtain corresponding multiple convolution loss feature maps, and multiple edge feature maps pass through the four stages Four stages can get multiple edge loss feature maps.

In some embodiments, a plurality of convolution feature map parameters respectively corresponding to the plurality of convolution loss feature maps are determined, and a plurality of edge feature map parameters respectively corresponding to the plurality of edge loss feature maps are determined.

Among them, let φ _t (z) denote multiple convolution loss feature maps and multiple edge loss feature maps extracted by the first network structure of the 0-t stage, assuming convolution loss feature maps and edge loss feature maps The shape is C _t ×H _t ×W _t , then C _t ×H _t ×W _t can be used as the feature map parameters of the convolution loss feature map and the edge loss feature map.

Among them, the calculation formula of perceptual edge loss value is as follows:

Among them, z and

denote the input convolutional feature map and edge loss feature map, respectively.

In some embodiments, the corresponding plurality of convolution loss feature maps are processed according to the plurality of convolution feature map parameters to obtain a plurality of first perceptual edge loss values, and the corresponding plurality of edges are processed according to the plurality of edge feature map parameters Loss feature map to get multiple second perceptual edge loss values.

Among them, the first perceptual edge loss value can be expressed as:

The second perceptual edge loss value can be expressed as:

one of them

and

Represents the convolutional feature maps extracted by ResNet Layer0 of the two modalities respectively,

and

represent the edge feature maps of the corresponding modalities, respectively.

In some embodiments, the perceptual edge loss value is generated according to multiple first perceptual edge loss values and multiple second perceptual edge loss values, for example: the sum of the first perceptual edge loss value and the second perceptual edge loss value, as the perceptual edge loss value.

The perceptual edge loss value is expressed as

In the embodiment of the present disclosure, combined with the perceptual edge loss (PEF Loss), the edge information of the image can be used as a guide to mine the common information in the modality characteristic space, reducing the difference between different modalities, thereby improving the The effect of modal object re-identification.

S306: Using the initial re-identification model to process various feature distance information to obtain a cross-modal center comparison loss value.

Embodiments of the present disclosure may also use an initial re-identification model to process various feature distance information, so as to obtain cross-modal center comparison loss values.

Fig. 5 is a schematic diagram of a feature space structure of an object provided according to an embodiment of the present disclosure. As shown in Figure 5, the cross-modal center contrast loss can act on the common feature space of the modality. In the embodiment of the present disclosure, the initial re-identification model can be used to process various feature distance information, for example: to process the features of different types of targets The distance between the center points, or the distance between the center points of the features corresponding to different modalities for the same category of targets, to obtain the cross-modal center comparison loss value.

S307: Train an initial re-identification model according to the initial loss value, perceptual edge loss value, and cross-modal center comparison loss value, so as to obtain a target re-identification model.

In some embodiments, the target loss value may be firstly generated according to the initial loss value, perceptual edge loss value, and cross-modal center comparison loss value. The target loss value may be, for example, the initial loss value, perceptual edge loss value, and cross-modal loss value. The sum of state center comparison loss values, the target loss value can be expressed as:

in,

represents the perceptual edge loss value,

represents the initial loss value,

Can represent cross-modal center contrast loss values.

In some embodiments, the initial re-identification model is trained according to the target loss value, that is, the parameters of the re-identification model are adjusted according to the target loss value until the target loss value meets the set conditions, for example: if the model convergence condition is met, the training The obtained re-identification model is used as the target re-identification model. Therefore, in the process of model training, combined with multi-task loss (that is, multiple loss values), the modal characteristic feature space and common feature space are optimized and adjusted, which enhances the cross-modal feature extraction ability of the model, and It can make the model extract more discriminative features, and can meet the requirements of cross-modal target re-identification for features, thereby improving the effect of target re-identification.

In the embodiment of the present disclosure, by acquiring multiple images, the multiple images respectively have corresponding multiple modalities and corresponding multiple labeled target categories, and multiple convolutional feature maps corresponding to the multiple modalities are acquired, And obtain multiple edge feature maps corresponding to multiple modalities, and obtain multiple feature distance information corresponding to multiple modalities, and according to multiple images, multiple convolution feature maps, and multiple edge feature maps , a variety of feature distance information, and multiple labeled target categories to train the initial re-identification model to obtain the target re-identification model. Therefore, the trained re-identification model can fully mine the features in various modal images and can enhance different modalities. Improve the accuracy of image matching, thereby improving the effect of cross-modal target re-identification. Furthermore, it solves the technical problem that the network model existing in the related technology is not sufficient for feature mining in multi-modal images, which affects the effect of cross-modal target re-identification. In addition, using the identity loss value as the initial loss value can make the model have a better person re-identification effect. Using VGGNet-16 as the first network structure can deeply identify the loss between the convolutional feature map and the edge feature map, thereby improving the accuracy of the perceived loss value. And in the process of model training, the multi-task loss (that is, multiple loss values) is combined to optimize and adjust the modal characteristic feature space and common feature space, which enhances the cross-modal feature extraction ability of the model, and can The model extracts more discriminative features, which can meet the feature requirements of cross-modal target re-identification, thereby improving the effect of target re-identification.

Fig. 6 is a schematic flowchart of a method for training an object re-identification model according to another embodiment of the present disclosure. Referring to Fig. 6, the method includes step S601 to step S610.

S601: Acquire multiple images, each of which has corresponding multiple modalities and multiple corresponding labeled target categories.

S602: Acquire multiple convolutional feature maps corresponding to multiple modalities, and acquire multiple edge feature maps corresponding to multiple modalities.

For specific descriptions of S601-S602, reference may be made to the foregoing embodiments, and details are not repeated here.

S603: Input multiple images into the batch normalization layer respectively, so as to obtain multiple feature vectors respectively corresponding to the multiple images output by the batch normalization layer.

In some embodiments, as shown in FIG. 2, the initial re-identification model also includes a batch normalization layer (Batch Normalization, BN). In the operation of obtaining various feature distance information corresponding to various modalities, firstly A plurality of images are respectively input into the batch normalization layer, so as to obtain a plurality of feature vectors respectively corresponding to the plurality of images output by the BN layer (for example represented by f _i ^m ).

S604: Determine, according to the multiple feature vectors, feature center points of multiple targets respectively corresponding to the multiple images.

For example, there are P categories of targets in a batch (Batch), each category contains K RGB images and K IR images, that is, B=2×P×K, assuming

Represents the feature center point of different modes of the k-th class target, then the feature center point can be expressed as:

Among them, m∈{rgb,ir} can be calculated by this formula

and

Then the feature center point of the kth class target

S605: Determine the first distance between the feature center points of different targets, and determine the second distance between the feature center points corresponding to different modalities of the same target, the first distance and the second distance together constitute a variety of feature distance information .

In some embodiments, to determine the first distance between feature center points of different objects, that is, to determine the distance between centers of object features of different categories, the first distance may be represented by d _inter . And, it is also possible to determine the second distance between the center points of the features corresponding to different modalities of the same target, that is, to determine the distance between the centers of the features of the two modalities of the target of the same category, which can be represented by d _intra distance, and the first distance and the second distance jointly constitute a variety of characteristic distance information. Therefore, determining a variety of feature distance information through the relationship between the feature center points of the target can constrain the relationship between the mode center and the category center, and can well adjust the feature extraction ability of the model.

It can be understood that the above example is only an exemplary description of obtaining various feature distance information, and in practical applications, any other possible ways can also be used to obtain, which is not limited here.

S606: Process multiple images using the initial re-identification model to obtain an initial loss value.

In some embodiments, in the operation of determining the initial loss value, multiple images may also be divided into multiple images with reference to multiple labeled target categories to obtain a triplet sample set, which may include: multiple images (using

), multiple first images (indicated by

), and multiple second images (with

express),

Multiple first images in the set correspond to the same labeled target category,

A plurality of second images in the set correspond to different labeled target categories,

and

can constitute a positive sample pair,

and

Negative sample pairs can be formed.

In some embodiments, the first Euclidean distance between the feature vector of the image and the feature vector of the first image is determined, and the feature vector is output by the batch normalization layer, that is to say, the batch normalization layer (BN) can be used to The distance between the eigenvectors and the eigenvectors of the first image is calculated to obtain the first Euclidean distance.

Moreover, a second Euclidean distance between the feature vector of the image and the feature vector of the second image may also be determined, and the first Euclidean distance and the second Euclidean distance may be represented by d, for example.

In some embodiments, the ternary loss value is determined according to a plurality of first Euclidean distances and a plurality of second Euclidean distances, and the ternary loss value is used as an initial loss value, and the calculation formula of the initial loss value is as follows:

in,

d _ii+ means the first Euclidean distance, d _ii- means the second Euclidean distance,

and respectively

Represents the set of positive sample pairs and negative sample pairs. Therefore, in the process of model training, the weighted ternary loss function (WRT Loss) can also be combined to introduce the concept of positive and negative samples, so that the classification prediction results are more aggregated, and the classification can be further separated.

S607: Using the initial re-identification model to process multiple convolutional feature maps and multiple edge feature maps to obtain perceptual edge loss values.

For a specific description of S607, reference may be made to the foregoing embodiments, and details are not repeated here.

S608: Using the initial re-identification model to determine a first target distance from multiple first distances, where the first target distance is the first distance with the smallest median among the multiple first distances.

Wherein, the first distance with the smallest median among the multiple first distances may be called the first target distance, for example:

Represents the minimum value of all d _inter , then

can be used as the first target distance.

S609: Calculate and obtain a cross-modal center comparison loss value according to the first target distance, multiple second distances, and the number of targets.

In some embodiments, the cross-modal center contrast loss value is calculated according to the first target distance and multiple second distances, and the number of targets. The cross-modal center contrast loss value (may be referred to as CMCC loss) is calculated as follows:

In the embodiment of the present disclosure, the distance between different modalities of the same category can be shortened through the CMCC loss, and the distance between features of different categories can be shortened at the same time, thereby optimizing the distribution state of the feature f _i ^m extracted by the model, It is convenient to use the features of this layer for target re-identification matching in the later stage.

S610: Train an initial re-identification model according to the initial loss value, perceptual edge loss value, and cross-modal center comparison loss value, so as to obtain a target re-identification model.

For example: a target loss value is generated according to an initial loss value, a perceptual edge loss value, and a cross-modal center comparison loss value, and the target loss value may be, for example, a combination of an initial loss value, a perception edge loss value, and a cross-modal center comparison loss value and, the target loss value can be expressed as:

in,

represents the perceptual edge loss value,

and

represents the initial loss value,

Can represent cross-modal center contrast loss values. In some embodiments, an initial re-identification model is trained based on a target loss value.

In the embodiment of the present disclosure, by acquiring multiple images, the multiple images respectively have corresponding multiple modalities and corresponding multiple labeled target categories, and multiple convolutional feature maps corresponding to the multiple modalities are acquired, And obtain multiple edge feature maps corresponding to multiple modalities, and obtain multiple feature distance information corresponding to multiple modalities, and according to multiple images, multiple convolution feature maps, and multiple edge feature maps , a variety of feature distance information, and multiple labeled target categories to train the initial re-identification model to obtain the target re-identification model. Therefore, the trained re-identification model can fully mine the features in various modal images and can enhance different modalities. Improve the accuracy of image matching, thereby improving the effect of cross-modal target re-identification. Furthermore, it solves the technical problem that the network model existing in the related technology is not sufficient for feature mining in multi-modal images, which affects the effect of cross-modal target re-identification. In addition, by determining a variety of feature distance information through the relationship between the feature center points of the target, the relationship between the mode center and the category center can be constrained, and the feature extraction ability of the model can be well adjusted. Moreover, the distance between different modes of the same category can be shortened through CMCC loss, and the distance between features of different categories can be shortened at the same time, thereby optimizing the distribution state of the feature f _i ^m extracted by the model, which is convenient for later use of this layer The features are matched for target re-identification.

In practical applications, as shown in Figure 2, the backbone network of the target re-identification model is a convolutional neural network (ResNet50 is used here). Specifically, for the input of two modalities of color images and infrared images, the present disclosure Divide ResNet50 into two parts. The convolutional layer (ResNet Layer0) in the initial stage adopts a dual-stream design, and the convolutional layer (ResNet Layer1-4) in the next four stages uses a dual-stream shared weight strategy. Extract the information of the two modes, and then perform pooling operations on the feature maps obtained by the convolutional layer (Generalized-mean (GeM) Pooling is used in the embodiment of the present disclosure), and then through the processing of batch regularization (Batch Normalization\ BN) to obtain the feature vector corresponding to each image extraction (used for image re-identification matching in the test application process), the feature vector will continue to pass through the fully connected (FC) layer and Softmax operation during the training process to obtain the target object category scores.

In the process of model training, a multi-task loss function is used, as shown in Equation 1, which incorporates four loss functions, namely identity loss (Id Loss), weighted ternary loss (WRT Loss), perceptual edge loss ( PEF Loss) and cross-modal center contrast loss (CMCC Loss). Among them, the first two losses are loss functions commonly used in existing methods, and the latter two losses (PEF Loss and CMCC Loss) are loss functions newly proposed in this disclosure. The first two losses are briefly introduced below, and then the key points Explain the latter two loss functions.

Assuming that rgb and ir represent RGB image modality and IR image modality respectively, let

Represents the input RGB image and IR image dataset, where m∈{rgb, ir}, H and W represent the height and width of the image, respectively, and 3 represents the number of channels of the image (RGB image contains three channels of R\G\B, IR images were converted to 3-channel by repeating their single-channel 3 times). Assuming that a batch (Batch) contains B pictures during the training process, let

Represents one of the RGB or IR images, then i∈{1, 2,..., B}.

(1) Identity loss (Id Loss) and weighted ternary loss (WRT Loss)

(1.1) Identity Loss (Id Loss):

As shown in Figure 1(a), the input image

The final fully connected (FC) layer and the vector after the Softmax operation are obtained through the network model, which is represented by p _i here, and the one-hot encoding of the corresponding label is represented by y _i :

Where j ∈ {1, 2, ..., N}, N is the number of categories of target objects in the data training set, then Id Loss can be expressed as:

(1.2) Weighted ternary loss (WRT Loss):

As shown in Figure 1(a), WRT loss

It is calculated by the model batch regularization (BN) layer and the feature vector obtained after the L2-Norm operation. The calculation formula of the loss function is as follows:

in

Represents a ternary sample set, which includes samples

samples of the same class

and samples of different classes

and

Constitute a positive sample pair,

and

Constitute a negative sample pair, d represents the Euclidean distance between feature vectors,

and respectively

Represents the set of positive sample pairs and negative sample pairs.

(2) Perceptual edge loss (PEF Loss)

As shown in Figure 1(a) and (b), the perceptual edge loss acts on the characteristic feature space of the modality. This part of the feature is generated by the unshared ResNet Layer0. In order to solve the feature between the RGB modality and the IR modality Difference, the PEF loss is directly optimized for the characteristic feature space using the edge profile information of the target as a guide, thus enabling the mining of common features among modalities.

Specifically, as shown in Figure 1(b), here we take the loss calculation of one of the modes as an example. The calculation of PEF loss includes two inputs: one is the convolution feature map extracted by ResNet Layer0; the other branch is It is to use the Sobel operator to perform convolution operation on the image input by the original modality, extract its edge information, and obtain the edge feature map. After that, the perceptual loss between the edge feature map and the convolutional feature map is calculated in PEF, and the VGGNet-16 model trained on ImageNet is used as the perceptual network, here we use φ={φ ₁ ,φ ₂ ,φ ₃ ,φ ₄ } represents the four stages, let φ _t (z) represent the feature map extracted by the perception network of the 0-t stage, assuming its shape is C _t ×H _t ×W _t , the calculation formula of PEF loss As follows:

where z and

Representing the input convolutional feature map and edge feature map, respectively, the calculation of the PEF loss of the two modes of RGB and IR is as follows:

one of them

and

Respectively represent the convolutional feature maps extracted by the ResNet Layer0 of the two modalities respectively,

and

represent the edge feature maps of the corresponding modalities respectively, and the final loss is the sum of the losses of the two modalities.

In the Perceptual Edge Loss (PEF Loss), the edge contour information of prior knowledge is used as the guidance of the common features of the modes, which makes the mode characteristic features extracted by the unshared Layer-0 more consistent and helps to reduce the mode The difference between them, so as to better realize the cross-modal target re-identification task.

(3) Cross-modal center contrast loss (CMCC Loss)

The embodiment of the present disclosure proposes a new cross-modal central contrast loss, which acts on the common feature space of the modality, that is, the feature vector (assumed to be represented by f _i ^m ) after the BN layer in Figure 1(a) is located Space. Assuming that there are P categories of target objects in a batch, each category contains K RGB images and K IR images, that is, B=2×P×K, and d _inter represents the distance between the centers of object features of different categories , d _intra represents the distance between the centers of the features of the two modalities of objects of the same category, assuming that the

Indicates the feature center of different modes of the kth object, and its calculation formula is:

Where m∈{rgb,ir} can be calculated by formula 8

and

Then the center of the kth class target object feature

CMCC loss can be obtained after

The calculation formula of is as follows:

in

Represents the minimum value of all d _inter . By optimizing the loss function, the distance between different modalities of the same category can be shortened, and the distance between features of different categories can be shortened, thereby optimizing the feature f extracted by the model The distribution state of _i ^m is convenient for the matching work of target re-identification using the features of this layer in the later stage.

Fig. 7 is a flow chart of training a target re-identification model according to an embodiment of the present disclosure. As shown in Figure 7, the following steps are included:

(1) Input image preprocessing stage

Step 1-1: Read the cross-modal target re-identification image data set, and obtain the original image and the category information of the corresponding target object;

Among them, the data set includes: training set (train set) and test set (test set), including the original image and the object category label corresponding to the image. During the training process, the image input model is used, and then the loss function is calculated in combination with the category label. During the test, the test set is divided into a set to be queried (query) and a set to be matched (gallery), which is used to test the re-identification performance of the model;

Algorithm model hyperparameters: including the size of the input image during model training, batch size, target objects and numbers of different modalities in the batch, image data enhancement methods, number of training iterations (Epoch), learning rate (learning rate) adjustment strategy, the type of optimizer (optimizer) used, as follows.

Input image size during model training: 288*144;

Batch size: 64 (includes 8 objects, 4 images of one object per modality);

Image data enhancement methods: random cropping, horizontal flipping;

The number of training iterations is: 200;

Optimizer: using Adam optimizer, weight decay (weight decay) is 0.0005;

Learning rate adjustment strategy:

The learning rate increases linearly from 0.0005 to 0.005 during the first 10 epochs, maintains 0.005 for 10-20 epochs, and then decays to one-tenth of the original every 5 epochs, and maintains 0.000005 until the 35th epoch to the end of training .

Step 1-2: According to the set batch size, the number of categories in the batch and the number of images under each category, organize the data of RGB and IR into a batch (Batch);

Step 1-3: Standardize the image, then adjust the image to the set width and height dimensions, and perform specified data enhancement transformation on it, and then load batches of data into GPU memory for later input into training In the model, and use the corresponding label to participate in the calculation of the later loss.

(2) Feature extraction stage

Step 2-1: Input the image data of the two modalities respectively along the dual-stream feature extraction network (structure shown in Figure 2), and send the data of each modality to their respective entry branches;

Step 2-2: The input data is transferred layer by layer, and the calculation of the corresponding layer is performed, and the modal characteristic part and the modal common part are sequentially passed;

Step 2-3: Through the forward propagation of step 2-2, the intermediate features and the final classification prediction score can be obtained, which will be used for the multi-task loss calculation in the next stage.

(3) Multi-task loss calculation stage

Step 3-1: For the input data of a batch, according to the calculation method of the above formula 1-9, you can get

and

Step 3-2: Add the four losses to get the final multi-task loss

value.

(4) Model iterative optimization stage

Step 4-1: The implementation code of this disclosure uses the automatic differentiation PyTorch deep learning framework, which supports the backpropagation of the entire algorithm model directly from the calculated multi-task loss value, and calculates the gradient value of the learnable parameters ;

Step 4-2: Use the set optimizer to update and optimize the learnable parameters of the model algorithm using the gradient calculated in step 4-1;

Step 4-3: Repeat all the above steps, and continuously update the model parameters in the process until the set number of training rounds is reached, and then stop the training process of the algorithm model.

(5) Model testing and evaluation stage

Step 5-1: Divide the test set, use the IR image as the query set (query), and the RGB image as the matching set (gallery). The test method is to use the IR image of the object as the query, in the RGB image set Match the image of the object to test the cross-modal object re-identification performance of the model;

Step 5-2: During the test, read the images of the test set (including images of query and gallery), input the data of both modalities into the test model, and obtain The feature vector of each image (the feature vector after the BN layer in Figure 2);

Step 5-3: use cosine distance to measure the similarity between the query image and all gallery images, and then sort according to the distance to obtain a list of gallery images (RGB images) matched by each query image (IR image);

Step 5-4: Calculate the evaluation indicators Rank-n and mAP commonly used in the target re-identification task, and evaluate the model performance by observing the indicator values;

Step 5-4: If the evaluation result does not meet the set requirements, you can adjust the hyperparameters of the model, restart from the first step of the process step, and continue to train the algorithm model. If the evaluation indicators meet the requirements, then Save the model weights, and the weights and model codes are the final cross-modal target re-identification solution.

In the technical scheme of the embodiment of the present disclosure:

1. The multi-task loss is used to optimize and adjust the modal feature space and common feature space, and complete the cross-modal target re-identification task end-to-end.

2. The perceptual edge loss is proposed, which can use the edge information of the image as a guide to mine the common information in the modality feature space, reducing the differences between different modalities.

3. A cross-modal center comparison loss is proposed, which acts on the common feature space. By constraining the relationship between the modal center and the category center, the feature extraction ability of the model can be well adjusted, so that the model can achieve excellent performance.

Through this scheme, the feature space can be optimized, the division of characteristic feature space and common feature space is proposed, and targeted adjustment and optimization are carried out, so as to realize an efficient end-to-end cross-modal target re-identification method. In the embodiment, the proposed perceptual edge loss can directly constrain the features of different modalities, introduce prior knowledge into the model feature extraction process, and enhance the cross-modal feature extraction capability of the model; the proposed cross-modal center comparison loss can make The model extracts more discriminative features, which effectively reduces the difference between the modalities of similar objects and increases the characteristic differences of different types of objects, which is conducive to the correct re-identification of cross-modal data by the model.

Fig. 8 is a schematic flowchart of a method for object re-identification provided according to another embodiment of the present disclosure. Referring to Fig. 8, the method includes step S801 to step S802.

S801: Acquire a reference image and an image to be recognized, where the modes of the reference image and the image to be recognized are different, and the reference image includes: a reference category.

Wherein, the reference image and the image to be recognized may be images collected in any scene, and the modalities of the reference image and the image to be recognized are different.

In some embodiments, the reference image can be an image of RGB modality, and the image to be recognized can be an image of IR modality; or the reference image can be an image of IR modality, and the image to be recognized can be an image of RGB modality, for This is not limited.

Moreover, the reference image also corresponds to a reference category, wherein the reference category is used to describe the category of the target object in the reference image, for example: the category of the target object is a vehicle, a pedestrian, or any other possible category, which is not limited.

S802: Input the reference image and the image to be recognized into the target re-recognition model trained by the above-mentioned target re-recognition model training method to obtain the target corresponding to the image to be recognized output by the target re-recognition model, and the target has a corresponding The target category of , the target category matches the reference category.

After the reference image and the image to be recognized are obtained above, the reference image and the image to be recognized are further input into the target re-identification model trained in the above embodiment, and the target corresponding to the image to be recognized and the corresponding target can be output through the target re-recognition model category, where the target category matches the reference category, e.g. the target category and the reference category are the same vehicle.

That is to say, through the target re-identification model, the same object as the target object in the reference image is recognized from the image to be recognized, so as to achieve the purpose of cross-modal target re-identification.

In the embodiment of the present disclosure, by acquiring the reference image and the image to be recognized, the modalities of the reference image and the image to be recognized are different, the reference image includes: a reference category, and the reference image and the image to be recognized are respectively input into the target re-identification model In the target re-identification model trained by the training method, the target corresponding to the image to be recognized outputted by the target re-identification model is obtained, the target has a corresponding target category, and the target category matches the reference category. Since the target re-recognition model trained by the above-mentioned target re-recognition model training method recognizes the image to be recognized, the features of the image to be recognized can be fully mined, the accuracy of image matching under different modalities can be enhanced, and the cross-modal The effect of target re-identification.

Fig. 9 is a schematic diagram of a training device for a target re-identification model according to another embodiment of the present disclosure. Referring to Fig. 9, the training device 90 of the target re-identification model includes:

The first acquiring module 901 is configured to acquire multiple images, and the multiple images respectively have corresponding multiple modalities and corresponding multiple labeled target categories;

The second acquisition module 902 is configured to acquire multiple convolutional feature maps corresponding to multiple modalities, and multiple edge feature maps corresponding to multiple modalities respectively;

The third acquiring module 903 is configured to acquire various feature distance information respectively corresponding to various modalities; and

The training module 904 is configured to train an initial re-identification model according to multiple images, multiple convolutional feature maps, multiple edge feature maps, multiple feature distance information, and multiple labeled target categories to obtain a target re-identification model.

In some embodiments, FIG. 10 is a schematic diagram of a training device for a target re-identification model provided according to another embodiment of the present disclosure. As shown in Figure 10, the training module 904 includes:

The first processing sub-module 9041 is used to process multiple images using an initial re-identification model to obtain an initial loss value;

The second processing sub-module 9042 is used to process multiple convolutional feature maps and multiple edge feature maps using the initial re-identification model to obtain perceptual edge loss values;

The third processing sub-module 9043 is used to process various feature distance information using the initial re-identification model to obtain cross-modal center comparison loss values;

The training sub-module 9044 is used to train the initial re-identification model according to the initial loss value, perceptual edge loss value, and cross-modal center comparison loss value, so as to obtain the target re-identification model.

In some embodiments, the initial re-identification model includes: a first network structure, and the first network structure is used to identify the perceptual loss value between the convolutional feature map and the edge feature map.

In some embodiments, the second processing submodule 9042 is specifically used to:

Input multiple convolution feature maps and multiple edge feature maps into the first network structure to obtain multiple convolution loss feature maps corresponding to the multiple convolution feature maps, and obtain multiple edge feature maps Multiple edge loss feature maps corresponding to each;

Determining a plurality of convolution feature map parameters respectively corresponding to a plurality of convolution loss feature maps, and determining a plurality of edge feature map parameters respectively corresponding to a plurality of edge loss feature maps;

Processing a plurality of corresponding convolution loss feature maps according to a plurality of convolution feature map parameters to obtain a plurality of first perceptual edge loss values;

Processing a plurality of corresponding edge loss feature maps according to a plurality of edge feature map parameters to obtain a plurality of second perceptual edge loss values; and

Perceptual edge loss values are generated based on the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.

In some embodiments, as shown in FIG. 10, the initial re-identification model includes: a batch normalization layer, and a third acquisition module 903, including:

The normalization processing sub-module 9031 is configured to input multiple images into the batch normalization layer respectively, so as to obtain multiple feature vectors respectively corresponding to the multiple images output by the batch normalization layer;

The center point determination submodule 9032 is used to determine the feature center points of multiple targets respectively corresponding to multiple images according to multiple feature vectors;

The distance determination sub-module 9033 is used to determine the first distance between the feature center points of different targets, and determine the second distance between the feature center points of the same target corresponding to different modalities, the first distance and the second distance are common A variety of feature distance information is formed.

In some embodiments, the third processing submodule 9043 is specifically used to:

Determining a first target distance from multiple first distances by using an initial re-identification model, where the first target distance is the first distance with the smallest median value among the multiple first distances;

A cross-modal center comparison loss value is calculated according to the first target distance, multiple second distances, and the number of targets.

In some embodiments, the initial re-identification model includes: a sequentially connected fully connected layer and an output layer, and the first processing submodule 9041 is specifically used for:

Sequentially input multiple images into the fully connected layer and the output layer to obtain multiple category feature vectors output by the output layer corresponding to the multiple images;

determining a plurality of encoding vectors respectively corresponding to a plurality of labeled target categories;

An identity loss value is generated according to multiple category feature vectors and corresponding multiple encoding vectors, and the identity loss value is used as an initial loss value.

In some embodiments, the first processing submodule 9041 is specifically used to:

Image division is performed on multiple images with reference to multiple labeled target categories to obtain a triplet sample set, which includes: multiple images, multiple first images, and multiple second images, and the multiple first images correspond to The same tagged target category, multiple second images corresponding to different tagged target categories;

determining the first Euclidean distance between the feature vector of the image and the feature vector of the first image, the feature vector being output by the batch normalization layer;

determining a second Euclidean distance between the eigenvectors of the image and the eigenvectors of the second image; and

According to the plurality of first Euclidean distances and the plurality of second Euclidean distances, a ternary loss value is determined, and the ternary loss value is used as an initial loss value.

In some embodiments, the training submodule 9044 is specifically used for:

Generate a target loss value based on the initial loss value, perceptual edge loss value, and cross-modal center comparison loss value;

If the target loss value satisfies the set conditions, the trained re-identification model is used as the target re-identification model.

In some embodiments, the plurality of modalities includes: a color image modality and an infrared image modality.

It should be noted that the foregoing explanations on the training method of the target re-identification model are also applicable to the device of the embodiment of the present disclosure, and will not be repeated here.

Fig. 11 is a schematic diagram of an object re-identification device according to another embodiment of the present disclosure. Referring to Figure 11, the target re-identification device 100 includes:

The fourth acquisition module 1001 is used to acquire a reference image and an image to be recognized, the modalities of the reference image and the image to be recognized are different, and the reference image includes: a reference category;

The recognition module 1002 is configured to respectively input the reference image and the image to be recognized into the target re-recognition model trained by the above-mentioned target re-recognition model training method, so as to obtain the target corresponding to the image to be recognized output by the target re-recognition model, A target has a corresponding target class, and the target class matches the reference class.

In the embodiment of the present disclosure, the object re-identification model trained by the above-mentioned object re-identification model training method can be used to recognize the image to be recognized, and determine the object corresponding to the image to be recognized. Therefore, the features of the image to be recognized can be fully exploited, the accuracy of image matching in different modalities can be enhanced, and the effect of cross-modal target re-recognition can be improved.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, a computer program product, and a computer program.

In order to achieve the above-mentioned embodiments, an embodiment of the present disclosure proposes an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be processed by the at least one processor. Instructions executed by the processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the target re-identification model training method of the embodiment of the present disclosure, or perform the target re-identification of the embodiment of the present disclosure method.

In order to realize the above-mentioned embodiments, the embodiments of the present disclosure propose a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to enable the computer to execute the target re-identification model training method of the embodiments of the present disclosure, Or execute the target re-identification method of the embodiment of the present disclosure.

In order to achieve the above embodiments, the embodiments of the present disclosure also propose a computer program product, when the instruction processor in the computer program product executes, execute the method for training the target re-identification model described in any one of the embodiments of the present disclosure, or Execute the object re-identification method described in any one of the embodiments of the present disclosure.

In order to realize the above-mentioned embodiments, the embodiments of the present disclosure propose a computer program, the computer program includes computer program code, when the computer program code is run on the computer, so that the computer executes any one of the embodiments of the present disclosure. The training method of the target re-identification model, or execute the target re-identification method described in any one of the embodiments of the present disclosure.

Figure 12 shows a block diagram of an exemplary computer device suitable for implementing embodiments of the present disclosure. The computer device 12 shown in FIG. 12 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.

As shown in FIG. 12, computer device 12 takes the form of a general-purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16 , system memory 28 , bus 18 connecting various system components including system memory 28 and processing unit 16 .

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include but are not limited to Industry Standard Architecture (Industry Standard Architecture; hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereinafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12 and include both volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or a cache memory 32 . Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 12, commonly referred to as a "hard drive").

Although not shown in FIG. 12, a disk drive for reading and writing to a removable nonvolatile disk (such as a "floppy disk") may be provided, as well as a removable nonvolatile disk (such as a Compact Disk ROM (Compact Disk). Disc Read Only Memory; hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media). In these cases, each drive may be connected to bus 18 via one or more data media interfaces. Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present disclosure.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including but not limited to an operating system, one or more application programs, other program modules, and program data , each or some combination of these examples may include implementations of network environments. The program modules 42 generally perform the functions and/or methods of the embodiments described in this disclosure.

The computer device 12 may also communicate with one or more external devices 14 (e.g., a keyboard, pointing device, display 24, etc.), and with one or more devices that enable a user to interact with the computer device 12, and/or with Any device (eg, network card, modem, etc.) that enables the computing device 12 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 22 . Moreover, the computer device 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN) and/or public networks, such as the Internet, through the network adapter 20. ) communication. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 . It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

The processing unit 16 executes various functional applications and training of the object re-identification model by running the program stored in the system memory 28 , for example, realizing the training method of the object re-identification model mentioned in the foregoing embodiments.

Other embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The present disclosure is intended to cover any modification, use or adaptation of the present disclosure. These modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure. . The specification and examples are to be considered exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

It should be noted that, in the description of the present disclosure, terms such as "first" and "second" are used for description purposes only, and should not be understood as indicating or implying relative importance. In addition, in the description of the present disclosure, unless otherwise specified, "plurality" means two or more.

Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the present disclosure includes additional implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order depending on the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present disclosure pertain.

It should be understood that various parts of the present disclosure may be implemented in hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although the embodiments of the present disclosure have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present disclosure, and those skilled in the art can understand the above-mentioned embodiments within the scope of the present disclosure. The embodiments are subject to changes, modifications, substitutions and variations.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims

A training method for a target re-identification model, the method comprising:

Acquiring multiple images, the multiple images respectively have corresponding multiple modes and corresponding multiple labeled target categories;

Obtaining a plurality of convolutional feature maps respectively corresponding to the various modalities, and obtaining a plurality of edge feature maps respectively corresponding to the various modalities;

Acquiring various kinds of feature distance information respectively corresponding to the multiple modalities; and

According to the plurality of images, the plurality of convolutional feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled object categories, train an initial re-identification model to obtain an object re-identification model.
The method according to claim 1, wherein said multiple image features, said multiple convolutional feature maps, said multiple edge feature maps, said multiple feature distance information, and said multiple annotations The target category trains the initial re-identification model to obtain the target re-identification model, including:

processing the plurality of images using the initial re-identification model to obtain an initial loss value;

processing the plurality of convolutional feature maps and the plurality of edge feature maps using the initial re-identification model to obtain perceptual edge loss values;

Using the initial re-identification model to process the various feature distance information to obtain a cross-modal center comparison loss value;

The initial re-identification model is trained according to the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value, so as to obtain the target re-identification model.
The method according to claim 1 or 2, wherein the initial re-identification model includes: a first network structure, the first network structure is used to identify the convolutional feature map and the edge feature map between Perceived loss value.
The method according to claim 3, wherein the processing of the plurality of convolutional feature maps and the plurality of edge feature maps using the initial re-identification model to obtain a perceptual edge loss value comprises:

inputting the plurality of convolutional feature maps and the plurality of edge feature maps into the first network structure to obtain a plurality of convolution loss feature maps respectively corresponding to the plurality of convolutional feature maps, And obtain a plurality of edge loss feature maps respectively corresponding to the plurality of edge feature maps;

determining a plurality of convolution feature map parameters respectively corresponding to the plurality of convolution loss feature maps, and determining a plurality of edge feature map parameters respectively corresponding to the plurality of edge loss feature maps;

Processing the corresponding plurality of convolution loss feature maps according to the plurality of convolution feature map parameters to obtain a plurality of first perceptual edge loss values;

processing the corresponding plurality of edge loss feature maps according to the plurality of edge feature map parameters to obtain a plurality of second perceptual edge loss values; and

The perceptual edge loss value is generated based on the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.
The method according to any one of claims 1 to 4, wherein the initial re-identification model includes: a batch normalization layer, and the acquisition of various feature distance information corresponding to the various modalities respectively includes:

Inputting the plurality of images into the batch normalization layer, respectively, to obtain a plurality of feature vectors respectively corresponding to the plurality of images output by the batch normalization layer;

determining, according to the plurality of feature vectors, feature center points of a plurality of targets respectively corresponding to the plurality of images;

determining a first distance between feature center points of different targets, and determining a second distance between feature center points of the same target corresponding to different modes, the first distance and the second The distances together constitute the various kinds of characteristic distance information.
The method according to any one of claims 2 to 5, wherein said use of said initial re-identification model to process said multiple feature distance information to obtain cross-modal center contrast loss values comprises:

Using the initial re-identification model to determine a first target distance from multiple first distances, the first target distance is the first distance with the smallest median value among the multiple first distances;

The cross-modal center comparison loss value is calculated according to the first target distance, multiple second distances, and the number of targets.
The method according to any one of claims 2 to 6, wherein the initial re-identification model includes: a sequentially connected fully connected layer and an output layer, and the use of the initial re-identification model to process the multiple Image to get the initial loss value, including:

Sequentially input the plurality of images into the fully connected layer and the output layer to obtain a plurality of category feature vectors output by the output layer corresponding to the plurality of images;

determining a plurality of encoding vectors respectively corresponding to the plurality of labeling target categories;

An identity loss value is generated according to the plurality of category feature vectors and the corresponding encoding vectors, and the identity loss value is used as the initial loss value.
The method according to any one of claims 5 to 7, wherein said processing said plurality of images using said initial re-identification model to obtain an initial loss value comprises:

performing image division on the plurality of images with reference to the plurality of labeled target categories to obtain a triplet sample set, the triplet sample set including: the plurality of images, the plurality of first images, and the plurality of second images, the multiple first images correspond to the same labeled target category, and the multiple second images correspond to different labeled target categories;

determining a first Euclidean distance between a feature vector of the image and a feature vector of the first image, the feature vector output by the batch normalization layer;

determining a second Euclidean distance between the feature vector of the image and the feature vector of the second image; and

A ternary loss value is determined according to the plurality of first Euclidean distances and the plurality of second Euclidean distances, and the ternary loss value is used as the initial loss value.
The method according to any one of claims 2 to 8, wherein said initial re-identification model is trained according to said initial loss value, said perceptual edge loss value, and said cross-modal center contrast loss value , to obtain the target re-identification model, including:

generating a target loss value based on the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value;

If the target loss value satisfies the set condition, the re-identification model obtained through training is used as the target re-identification model.
The method according to any one of claims 1 to 9, wherein the plurality of modalities include: a color image modality and an infrared image modality.
A target re-identification method, comprising:

Obtaining a reference image and an image to be recognized, the modes of the reference image and the image to be recognized are different, and the reference image includes: a reference category;

The reference image and the image to be recognized are respectively input into the target re-identification model trained by the target re-identification model training method according to any one of claims 1-10, so as to obtain the target re-identification The target corresponding to the image to be recognized outputted by the model has a corresponding target category, and the target category matches the reference category.
A training device for a target re-identification model, comprising:

The first acquisition module is configured to acquire a plurality of images, and the plurality of images respectively have corresponding multiple modalities and corresponding multiple labeled target categories;

The second acquisition module is configured to acquire a plurality of convolutional feature maps respectively corresponding to the various modalities, and obtain a plurality of edge feature maps respectively corresponding to the various modalities;

A third acquiring module, configured to acquire various kinds of feature distance information respectively corresponding to the various modalities; and

A training module, configured to train initial re-identification according to the plurality of images, the plurality of convolutional feature maps, the plurality of edge feature maps, the plurality of feature distance information, and the plurality of labeled target categories model to get the target re-identification model.
The apparatus of claim 12, wherein said training module comprises:

The first processing submodule is used to process the plurality of images using the initial re-identification model to obtain an initial loss value;

The second processing submodule is used to process the plurality of convolutional feature maps and the plurality of edge feature maps using the initial re-identification model to obtain a perceptual edge loss value;

The third processing submodule is used to process the various feature distance information using the initial re-identification model to obtain a cross-modal center comparison loss value;

The training submodule is configured to train the initial re-identification model according to the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value, so as to obtain the target re-identification model.
The device according to claim 12 or 13, wherein the initial re-identification model includes: a first network structure, the first network structure is used to identify the convolutional feature map and the edge feature map between Perceived loss value.
The device according to claim 13 or 14, wherein the second processing submodule is specifically used for:

inputting the plurality of convolutional feature maps and the plurality of edge feature maps into the first network structure to obtain a plurality of convolution loss feature maps respectively corresponding to the plurality of convolutional feature maps, And obtain a plurality of edge loss feature maps respectively corresponding to the plurality of edge feature maps;

determining a plurality of convolution feature map parameters respectively corresponding to the plurality of convolution loss feature maps, and determining a plurality of edge feature map parameters respectively corresponding to the plurality of edge loss feature maps;

Processing the corresponding plurality of convolution loss feature maps according to the plurality of convolution feature map parameters to obtain a plurality of first perceptual edge loss values;

processing the corresponding plurality of edge loss feature maps according to the plurality of edge feature map parameters to obtain a plurality of second perceptual edge loss values; and

The perceptual edge loss value is generated based on the plurality of first perceptual edge loss values and the plurality of second perceptual edge loss values.
The device according to any one of claims 12 to 15, wherein the initial re-identification model comprises: a batch normalization layer, and the third acquisition module comprises:

A normalization processing submodule, configured to input the plurality of images into the batch normalization layer respectively, so as to obtain a plurality of feature vectors respectively corresponding to the plurality of images output by the batch normalization layer;

A central point determination submodule, configured to determine, according to the multiple feature vectors, the feature center points of multiple targets corresponding to the multiple images;

A distance determining submodule, configured to determine a first distance between feature center points of different targets, and determine a second distance between feature center points corresponding to different modes of the same target, the first The first distance and the second distance together constitute the various kinds of characteristic distance information.
The device according to claim 16, wherein the third processing submodule is specifically used for:

Using the initial re-identification model to determine a first target distance from multiple first distances, the first target distance is the first distance with the smallest median value among the multiple first distances;

The cross-modal center comparison loss value is calculated according to the first target distance, multiple second distances, and the number of targets.
The device according to any one of claims 13 to 17, wherein the initial re-identification model includes: a sequentially connected fully connected layer and an output layer, and the first processing submodule is specifically used for:

Sequentially input the plurality of images into the fully connected layer and the output layer to obtain a plurality of category feature vectors output by the output layer corresponding to the plurality of images;

determining a plurality of encoding vectors respectively corresponding to the plurality of labeling target categories;

An identity loss value is generated according to the plurality of category feature vectors and the corresponding encoding vectors, and the identity loss value is used as the initial loss value.
The device according to any one of claims 16 to 18, wherein the first processing submodule is specifically used for:

performing image division on the plurality of images with reference to the plurality of labeled target categories to obtain a triplet sample set, the triplet sample set including: the plurality of images, the plurality of first images, and the plurality of second images, the multiple first images correspond to the same labeled target category, and the multiple second images correspond to different labeled target categories;

determining a first Euclidean distance between a feature vector of the image and a feature vector of the first image, the feature vector output by the batch normalization layer;

determining a second Euclidean distance between the feature vector of the image and the feature vector of the second image; and

A ternary loss value is determined according to the plurality of first Euclidean distances and the plurality of second Euclidean distances, and the ternary loss value is used as the initial loss value.
The device according to any one of claims 13 to 19, wherein the training submodule is specifically used for:

generating a target loss value based on the initial loss value, the perceptual edge loss value, and the cross-modal center comparison loss value;

If the target loss value satisfies the set condition, the re-identification model obtained through training is used as the target re-identification model.
The device according to any one of claims 12 to 20, wherein said plurality of modalities comprises: a color image modality and an infrared image modality.
A target re-identification device, comprising:

The fourth acquisition module is used to acquire a reference image and an image to be recognized, the modalities of the reference image and the image to be recognized are different, and the reference image includes: a reference category;

A recognition module, configured to input the reference image and the image to be recognized into the target re-recognition model trained by the target re-recognition model training device according to any one of claims 12-21, so as to obtain The object corresponding to the image to be recognized outputted by the object re-identification model has a corresponding object category, and the object category matches the reference category.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-10. method, or perform the method described in claim 11.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method according to any one of claims 1-10, or execute the method described in claim 11 Methods.
A computer program product, wherein the computer program product includes computer program code, when the computer program code is run on a computer, to perform the method according to any one of claims 1-10, or to perform the right The method described in claim 11.
A computer program, wherein the computer program includes computer program code, when the computer program code is run on the computer, so that the computer executes the method according to any one of claims 1-10, or executes the claim 11 as described in the method.