CN116664833A

CN116664833A - Method for improving target re-identification model capacity and target re-identification method

Info

Publication number: CN116664833A
Application number: CN202310512340.1A
Authority: CN
Inventors: 孙宇轩
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-29

Abstract

The invention relates to the field of computer vision, in particular to a method for improving the capability of a target re-recognition model and a target re-recognition method, and aims to solve the problems of low accuracy and poor generalization capability of the conventional target re-recognition model. To this end, the method for improving the target re-recognition model capability of the invention comprises the following steps: inputting the sample image into a backbone network to obtain a ReID feature map; and respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein the total loss function used by the joint training is the weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so as to obtain the optimized ReID detection head network.

Description

Method for improving target re-identification model capacity and target re-identification method

Technical Field

The invention relates to the field of computer vision, and particularly provides a method for improving the capability of a target re-recognition model and a target re-recognition method.

Background

With the continuous development of modern technology, camera monitoring is an essential safety guarantee in modern life. The pedestrian re-recognition technology is to search the pedestrian image in the cross-field and cross-camera through given pedestrian information by utilizing the image processing related technology in the computer vision field, so that the limitation of the visual field brought by a single camera is broken through. Pedestrian re-recognition technology has a wide range of applications, such as: in the video monitoring fields of public security, security and the like, the track of the target person can be positioned in real time by only one picture, and the method has a great auxiliary effect on capturing suspects in a transregional positioning manner.

Because of the significant domain spacing between the different data sets, pedestrian re-identification remains a significant difficulty in cross-shot, cross-domain issues. For example, in the existing public data set, the mark-1501 data set is collected in a domestic summer campus scenario, and the DukeMTMC-ReID data set is collected in a foreign campus in winter, and this significant environmental difference models the domain interval between the two data sets. More training data used at present come from monitoring data such as streets, stations, shops, retail stores, airports and the like, scene styles are different, and the field interval is larger due to the influence of factors such as a lens shooting angle, light rays and the like. Resulting in a training of the fitted model in the source domain and a significant degradation of test performance in the target domain.

Accordingly, there is a need in the art for a method and a target re-recognition method that improves the ability of a target re-recognition model to address the above-described problems.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks, the present invention is provided to provide a method for improving the capability of a target re-recognition model and a target re-recognition method, which solve or at least partially solve the technical problems of low accuracy and poor generalization capability of the re-recognition of the existing target re-recognition model.

In a first aspect, the present invention provides a method for improving the ability of a target re-recognition model, comprising the steps of:

inputting the sample image into a backbone network to obtain a ReID feature map;

and respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein the total loss function used by the joint training is the weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so as to obtain an optimized backbone network and the ReID detection head network.

In a specific embodiment, the method further comprises:

and inputting the sample image into a segmentation model, and obtaining a first mask image corresponding to the sample image, wherein the first mask image is used for distinguishing the region of interest from a background region.

In one specific embodiment, in training the semantic segmentation network, the method comprises:

generating a semantic segmentation feature map corresponding to the ReID feature map through the semantic segmentation network;

acquiring a second mask map aligned with the semantic segmentation feature map according to the first mask map;

inputting the second mask map and the semantic segmentation feature map into the second loss function, and optimizing the semantic segmentation network by taking the probability value in the semantic segmentation map corresponding to the region of interest in the second mask map as a high target and the probability value in the semantic segmentation map corresponding to the background region in the second mask map as a low target.

In one embodiment of the present invention, in one embodiment,

the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics ⁱ The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;

the semantic segmentation network is a 4-layer feature pyramid which performs step-by-step up-sampling from top to bottom, the 1 st pyramid layer performs 1*1 convolution on the output from the 5 th convolution layer and performs 2-time up-sampling, and the first fusion feature is obtained by fusing the output of the 1 st pyramid layer and the output of the 4 th convolution layer after 1*1 convolution; the 2 nd pyramid layer carries out up-sampling for 2 times on the first fusion feature, and carries out fusion with the output of the 3 rd convolution layer after 1*1 convolution to obtain a second fusion feature; the third pyramid layer carries out up-sampling for 2 times on the second fusion feature, and carries out fusion with the output of the third convolution layer after 1*1 convolution to obtain a third fusion feature; and the 4 th pyramid layer carries out up-sampling for 2 times on the third fusion feature, and carries out fusion with the output of the 2 nd convolution layer after 1*1 convolution to obtain the semantic segmentation feature map.

In one embodiment of the present invention, in one embodiment,

the semantic segmentation network multiplexes the first two convolution layers of the 5 convolution layers, wherein the 2 nd convolution layer outputs the semantic segmentation feature map.

In one embodiment of the present invention, in one embodiment,

the semantic segmentation network multiplexes the whole 5 convolution layers, wherein the 5 th convolution layer outputs the semantic segmentation feature map.

In a specific embodiment, the obtaining, according to the first mask map, a second mask map aligned with the semantic segmentation feature map includes:

pooling the first mask map in a maximum pooling mode to obtain a second mask map; or alternatively

And pooling the first mask map by voting to obtain a second mask map.

In a specific embodiment, in training the ReID detection head network, the method includes:

sequentially passing the ReID feature map through a pooling layer and a convolution layer, and obtaining the distance between positive and negative sample images through a triplet loss function;

and the characteristics obtained after convolution sequentially pass through a normalization layer and a full connection layer, and a classification result is obtained through a cross entropy function.

In a specific embodiment, the method comprises:

and coordinating the convergence speed of the ReID detection head network and the semantic segmentation network by dynamically adjusting the weighting coefficient of the second loss function.

In a specific embodiment, the total loss function is:

wherein Lreid is the first loss function, lseg is the second loss function, α is the super parameter, cur_iter is the current iteration number, and total_iter is the total iteration number.

In a specific embodiment, before the joint training of the ReID detection header network and the semantic segmentation network, the method further comprises:

and training the backbone network and the ReID detection head network for preset times in advance.

In a second aspect, the present invention provides a target re-identification method, comprising:

target re-identification is performed using a ReID detection head network optimized according to the method of any one of the first aspects.

In a specific embodiment, the target comprises one or more of a pedestrian and a vehicle.

One or more of the above technical solutions of the present invention at least has one or more of the following

The beneficial effects are that:

in the technical scheme of implementing the invention, the semantic segmentation branch constraint network is added to the target re-recognition model to pay attention to the foreground information, so that the background influence is reduced, the domain generalization capability of pedestrian features can be improved, and the accuracy of re-recognition of the existing target re-recognition model can be improved.

Drawings

The present disclosure will become more readily understood with reference to the accompanying drawings. As will be readily appreciated by those skilled in the art: the drawings are for illustrative purposes only and are not intended to limit the scope of the present invention. Moreover, like numerals in the figures are used to designate like parts, wherein:

FIG. 1 is a flow chart of the main steps of a method for improving the ability of a target re-recognition model according to one embodiment of the invention;

FIG. 2 is a schematic diagram of ReID detection head network and semantic segmentation network joint training according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for training a semantic segmentation network according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of a semantic segmentation feature map corresponding to a ReID feature map generated by a semantic segmentation network according to one embodiment of the present invention;

FIG. 5 is a schematic diagram of a second mask map aligned with a semantic segmentation feature map according to a first mask map acquisition in one embodiment of the invention;

fig. 6 is a schematic diagram of a process of training a ReID detection head network according to one embodiment of the invention.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" means all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" has a meaning similar to "A and/or B" and may include A alone, B alone or A and B. The singular forms "a", "an" and "the" include plural referents.

As used herein, directional terms such as "front", "front side", "front", "rear side", and "rear" are used with reference to the fore-aft direction of a vehicle in which components are mounted to the vehicle. As referred to herein, "longitudinal", "longitudinal section" are referenced to the fore-and-aft direction of the component after installation in a vehicle, while "transverse", "cross section" are referenced to the longitudinal direction.

In order to solve the technical problems, in particular to solve the problems of low accuracy and poor generalization capability of the conventional target re-recognition model, the invention provides a method for improving the capability of the target re-recognition model and a target re-recognition method.

In a first aspect of the present invention, the present invention provides a method for improving the capability of a target re-recognition model, referring to fig. 1, including the following steps S1-S2:

s1, inputting the sample image into a backbone network to obtain a ReID feature map.

In one example, the sample image is a batch of pedestrian images, and after data enhancement such as random image filling, image clipping, image flipping, image erasing and the like is performed on the pedestrian images, the sample images are input into a backbone network to generate 2048x12x 6-dimensional ReID feature maps.

S2, respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein a total loss function used by the joint training is a weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so that an optimized backbone network and the ReID detection head network are obtained.

In one example, as shown in fig. 2, the encoder (i.e., backbone network) of the ReID detection head network and the semantic segmentation network are identical, weight sharing. And respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for joint training, obtaining loss functions of two network branches, and then carrying out weighted summation, so that model parameters of a backbone network and the ReID detection head network are optimized, and the target re-identification model capacity is improved.

It should be noted that, training two network branches using the same batch of sample images can improve accuracy and generalization capability of re-recognition of the target re-recognition model compared to training the ReID detection head network branches and the segmentation data set training semantic segmentation network branches with ReID data respectively.

In a specific embodiment, the method further comprises:

For example, using Swin-B as an initial model, pre-training the weight of the initial model using the COCO component segmentation dataset, selecting a portion of the high quality images in the sample image as a training subset, manually labeling the region of interest and the background region, and then adding the training in batches. In the training process, pedestrians and personal belongings are continuously corrected manually to serve as the region of interest, the rest is served as the background region, and the segmentation model with higher accuracy is finally obtained through the process of iterative training model-reasoning-correcting results.

Further, a first mask map for distinguishing the region of interest from the background region can be obtained by inputting the sample image into the segmentation model, wherein the first mask map is a 01 binary matrix, 1 represents the region of interest, and 0 represents the background region.

It will be appreciated by those skilled in the art that the first mask map described above is not entered into the backbone network, but is used only when calculating the loss function of the semantic segmentation network.

In one embodiment, referring to fig. 3, in training the semantic segmentation network, the method includes the following steps S20-S22:

s20, generating a semantic segmentation feature map corresponding to the ReID feature map through the semantic segmentation network.

In one embodiment of the present invention, in one embodiment,

For example, as shown in fig. 4, the sample image generates a ReID feature map C5 (vector size [1,2048,12,6 ]) via 5 convolution layers of the resnet network, generates a first fused feature (vector size [1,64,12,6 ]), compresses the channel number by a convolution kernel of 1*1, upsamples one-fold (vector size [1,64,24,12 ]) using bilinear interpolation. The size of the left C4 is [1,1024,24,12], the channel number is compressed through 1*1 convolution kernels to obtain [1,64,24,12], the two feature maps are added and fused to generate a second fusion feature, and the like until the final layer outputs the semantic segmentation feature map (vector size [1,64,96,48 ]).

In one embodiment of the present invention, in one embodiment,

In a further embodiment of the present invention,

S21, acquiring a second mask map aligned with the semantic segmentation feature map according to the first mask map.

For example, the semantic segmentation feature map and the first mask map are classified pixel by pixel, so as to obtain a second mask map.

In a preferred example, the mask map may be subjected to data enhancement operations, such as random image filling, image cropping, image flipping, image erasure, etc., prior to pixel-by-pixel classification using the first mask map.

S22, inputting the second mask map and the semantic segmentation feature map into the second loss function, and optimizing the semantic segmentation network by taking the probability value in the semantic segmentation map corresponding to the region of interest in the second mask map as a high target and the probability value in the semantic segmentation map corresponding to the background region in the second mask map as a low target.

In a preferred embodiment, the second loss function of the semantic segmentation network is a binary cross entropy loss function:

L _Seg ＝-(ylog(p(x))+(1-y)log(1-p(x))

wherein y is the value (1 or 0) of each pixel point in the second mask map, x is the value of the corresponding pixel point in the segmentation feature map, p () is a Sigmoid activation function, and the feature value is normalized to the probability value of 0-1. When y=1, L _Seg -ylog (p (x)); when y=0, L _Seg ＝-log(1-p(x))。

Further, optimization is performed with the objective that the probability value in the semantic segmentation map corresponding to the region of interest (y=1) in the second mask map is high (probability value is 1) and the probability value in the semantic segmentation map corresponding to the background region (y=0) in the second mask map is low (probability value is 0).

And pooling the first mask map by voting to obtain a second mask map.

For example, as shown in fig. 5, the first mask pattern of one 4*4 pixel value is pooled into the second mask pattern of 2×2 pixel value, and one 1 to four 1 in each of four 2×2 bins. If the pooling is the maximum, 1 is only needed in 2x 2, and the pooling is 1; the pooling is performed by voting, which is to default to 1 (or 0) when 0 is more and 1 is more, so that the second mask pattern of the downsampling is more accurate than the maximum pooling.

In one embodiment, referring to fig. 6, in training the ReID detection head network, the method includes the following steps S23-S24:

s23, sequentially passing the ReID feature map through a pooling layer and a convolution layer, and obtaining the distance between positive and negative sample images through a triplet loss function;

s24, the characteristics obtained after convolution sequentially pass through a normalization layer and a full connection layer, and a classification result is obtained through a cross entropy function.

In a specific embodiment, the method comprises:

Further, the dynamic weight adjustment mode enables the weight of the separation loss function to increase along with the number of training iteration rounds, and compared with the direct weighted summation mode, the convergence rate of the network model can be obviously improved.

In a specific embodiment, the total loss function is:

In one example, after the ReID feature map is input into the ReID detection head network to perform several rounds of training, the semantic segmentation network branches are added into the training, and the encoders (i.e. backbone networks) of the semantic segmentation network and the ReID detection head network are identical, and the weights are shared. Through delaying training semantic segmentation network branches, segmentation feature graphs with higher semanteme can be extracted, and the whole model is promoted.

In a second aspect of the present invention, the present invention further provides a target re-recognition method, including:

target re-identification is performed using a ReID detection head network optimized by the method according to any one of the first aspects of the invention.

It should be noted that, the semantic segmentation network only serves as an auxiliary network to add constraint to the ReID detection head network, so that the model can promote attention to the region of interest and reduce attention to the background region, and the generalization capability of the model is remarkably improved. In the test reasoning stage, semantic segmentation network branches can be removed, network calculation amount is reduced, and reasoning speed is not influenced.

Although the pedestrian is taken as the target in the target re-recognition in the embodiment of the present invention, other targets such as a vehicle may be used in practical application, and the present invention is not limited thereto.

It will be appreciated by those skilled in the art that the present invention may implement all or part of the above-described methods according to the above-described embodiments, or may be implemented by means of a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and where the computer program may implement the steps of the above-described embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.

Further, the invention also provides a control device. In one control device embodiment according to the present invention, the control device includes a processor and a storage device, the storage device may be configured to store a program for executing the method for improving the capability of the target re-recognition model and the target re-recognition method of the above-described method embodiment, and the processor may be configured to execute the program in the storage device, including, but not limited to, the program for executing the method for improving the capability of the target re-recognition model and the target re-recognition method of the above-described method embodiment. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The control device may be a control device formed of various electronic devices.

Further, the invention also provides a computer readable storage medium. In one embodiment of the computer readable storage medium according to the present invention, the computer readable storage medium may be configured to store a program for executing the above-described method for improving the capability of the target re-recognition model and the target re-recognition method of the method embodiment, the program being loadable and executable by a processor to implement the above-described method for improving the capability of the target re-recognition model and the target re-recognition method. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present invention is a non-transitory computer readable storage medium.

Further, it should be understood that, since the respective modules are merely set to illustrate the functional units of the apparatus of the present invention, the physical devices corresponding to the modules may be the processor itself, or a part of software in the processor, a part of hardware, or a part of a combination of software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the apparatus may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solution to deviate from the principle of the present invention, and therefore, the technical solution after splitting or combining falls within the protection scope of the present invention.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention. .

Claims

1. A method for improving the ability of a target re-recognition model, comprising:

2. The method according to claim 1, wherein the method further comprises:

3. The method of claim 2, wherein during training of the semantic segmentation network, the method comprises:

4. The method of claim 3, wherein the step of,

5. The method of claim 3, wherein the step of,

6. The method of claim 3, wherein the step of,

7. The method according to any of claims 3-6, wherein the obtaining a second mask map aligned with the semantic segmentation feature map from the first mask map comprises:

And pooling the first mask map by voting to obtain a second mask map.

8. The method according to claim 1, wherein in training the ReID detection head network, the method comprises:

9. The method according to claim 1, characterized in that the method comprises:

10. The method of claim 9, wherein the total loss function is:

wherein L is _reid As the first loss function, the L _seg For the second loss function, α is the super parameter, cur_iter is the current iteration number, and total_iter is the total iteration number.

11. The method of claim 1, wherein prior to co-training the ReID detection header network and semantic segmentation network, the method further comprises:

12. A method of target re-identification, comprising:

target re-identification using a ReID detection head network optimized according to the method of any one of claims 1-11.

13. The method of claim 12, wherein the target comprises one or more of a pedestrian and a vehicle.