CN116664833A - Method for improving target re-identification model capacity and target re-identification method - Google Patents

Method for improving target re-identification model capacity and target re-identification method Download PDF

Info

Publication number
CN116664833A
CN116664833A CN202310512340.1A CN202310512340A CN116664833A CN 116664833 A CN116664833 A CN 116664833A CN 202310512340 A CN202310512340 A CN 202310512340A CN 116664833 A CN116664833 A CN 116664833A
Authority
CN
China
Prior art keywords
network
reid
semantic segmentation
convolution
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310512340.1A
Other languages
Chinese (zh)
Inventor
孙宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuncong Technology Group Co Ltd
Original Assignee
Yuncong Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuncong Technology Group Co Ltd filed Critical Yuncong Technology Group Co Ltd
Priority to CN202310512340.1A priority Critical patent/CN116664833A/en
Publication of CN116664833A publication Critical patent/CN116664833A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of computer vision, in particular to a method for improving the capability of a target re-recognition model and a target re-recognition method, and aims to solve the problems of low accuracy and poor generalization capability of the conventional target re-recognition model. To this end, the method for improving the target re-recognition model capability of the invention comprises the following steps: inputting the sample image into a backbone network to obtain a ReID feature map; and respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein the total loss function used by the joint training is the weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so as to obtain the optimized ReID detection head network.

Description

Method for improving target re-identification model capacity and target re-identification method
Technical Field
The invention relates to the field of computer vision, and particularly provides a method for improving the capability of a target re-recognition model and a target re-recognition method.
Background
With the continuous development of modern technology, camera monitoring is an essential safety guarantee in modern life. The pedestrian re-recognition technology is to search the pedestrian image in the cross-field and cross-camera through given pedestrian information by utilizing the image processing related technology in the computer vision field, so that the limitation of the visual field brought by a single camera is broken through. Pedestrian re-recognition technology has a wide range of applications, such as: in the video monitoring fields of public security, security and the like, the track of the target person can be positioned in real time by only one picture, and the method has a great auxiliary effect on capturing suspects in a transregional positioning manner.
Because of the significant domain spacing between the different data sets, pedestrian re-identification remains a significant difficulty in cross-shot, cross-domain issues. For example, in the existing public data set, the mark-1501 data set is collected in a domestic summer campus scenario, and the DukeMTMC-ReID data set is collected in a foreign campus in winter, and this significant environmental difference models the domain interval between the two data sets. More training data used at present come from monitoring data such as streets, stations, shops, retail stores, airports and the like, scene styles are different, and the field interval is larger due to the influence of factors such as a lens shooting angle, light rays and the like. Resulting in a training of the fitted model in the source domain and a significant degradation of test performance in the target domain.
Accordingly, there is a need in the art for a method and a target re-recognition method that improves the ability of a target re-recognition model to address the above-described problems.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks, the present invention is provided to provide a method for improving the capability of a target re-recognition model and a target re-recognition method, which solve or at least partially solve the technical problems of low accuracy and poor generalization capability of the re-recognition of the existing target re-recognition model.
In a first aspect, the present invention provides a method for improving the ability of a target re-recognition model, comprising the steps of:
inputting the sample image into a backbone network to obtain a ReID feature map;
and respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein the total loss function used by the joint training is the weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so as to obtain an optimized backbone network and the ReID detection head network.
In a specific embodiment, the method further comprises:
and inputting the sample image into a segmentation model, and obtaining a first mask image corresponding to the sample image, wherein the first mask image is used for distinguishing the region of interest from a background region.
In one specific embodiment, in training the semantic segmentation network, the method comprises:
generating a semantic segmentation feature map corresponding to the ReID feature map through the semantic segmentation network;
acquiring a second mask map aligned with the semantic segmentation feature map according to the first mask map;
inputting the second mask map and the semantic segmentation feature map into the second loss function, and optimizing the semantic segmentation network by taking the probability value in the semantic segmentation map corresponding to the region of interest in the second mask map as a high target and the probability value in the semantic segmentation map corresponding to the background region in the second mask map as a low target.
In one embodiment of the present invention, in one embodiment,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network is a 4-layer feature pyramid which performs step-by-step up-sampling from top to bottom, the 1 st pyramid layer performs 1*1 convolution on the output from the 5 th convolution layer and performs 2-time up-sampling, and the first fusion feature is obtained by fusing the output of the 1 st pyramid layer and the output of the 4 th convolution layer after 1*1 convolution; the 2 nd pyramid layer carries out up-sampling for 2 times on the first fusion feature, and carries out fusion with the output of the 3 rd convolution layer after 1*1 convolution to obtain a second fusion feature; the third pyramid layer carries out up-sampling for 2 times on the second fusion feature, and carries out fusion with the output of the third convolution layer after 1*1 convolution to obtain a third fusion feature; and the 4 th pyramid layer carries out up-sampling for 2 times on the third fusion feature, and carries out fusion with the output of the 2 nd convolution layer after 1*1 convolution to obtain the semantic segmentation feature map.
In one embodiment of the present invention, in one embodiment,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network multiplexes the first two convolution layers of the 5 convolution layers, wherein the 2 nd convolution layer outputs the semantic segmentation feature map.
In one embodiment of the present invention, in one embodiment,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network multiplexes the whole 5 convolution layers, wherein the 5 th convolution layer outputs the semantic segmentation feature map.
In a specific embodiment, the obtaining, according to the first mask map, a second mask map aligned with the semantic segmentation feature map includes:
pooling the first mask map in a maximum pooling mode to obtain a second mask map; or alternatively
And pooling the first mask map by voting to obtain a second mask map.
In a specific embodiment, in training the ReID detection head network, the method includes:
sequentially passing the ReID feature map through a pooling layer and a convolution layer, and obtaining the distance between positive and negative sample images through a triplet loss function;
and the characteristics obtained after convolution sequentially pass through a normalization layer and a full connection layer, and a classification result is obtained through a cross entropy function.
In a specific embodiment, the method comprises:
and coordinating the convergence speed of the ReID detection head network and the semantic segmentation network by dynamically adjusting the weighting coefficient of the second loss function.
In a specific embodiment, the total loss function is:
wherein Lreid is the first loss function, lseg is the second loss function, α is the super parameter, cur_iter is the current iteration number, and total_iter is the total iteration number.
In a specific embodiment, before the joint training of the ReID detection header network and the semantic segmentation network, the method further comprises:
and training the backbone network and the ReID detection head network for preset times in advance.
In a second aspect, the present invention provides a target re-identification method, comprising:
target re-identification is performed using a ReID detection head network optimized according to the method of any one of the first aspects.
In a specific embodiment, the target comprises one or more of a pedestrian and a vehicle.
One or more of the above technical solutions of the present invention at least has one or more of the following
The beneficial effects are that:
in the technical scheme of implementing the invention, the semantic segmentation branch constraint network is added to the target re-recognition model to pay attention to the foreground information, so that the background influence is reduced, the domain generalization capability of pedestrian features can be improved, and the accuracy of re-recognition of the existing target re-recognition model can be improved.
Drawings
The present disclosure will become more readily understood with reference to the accompanying drawings. As will be readily appreciated by those skilled in the art: the drawings are for illustrative purposes only and are not intended to limit the scope of the present invention. Moreover, like numerals in the figures are used to designate like parts, wherein:
FIG. 1 is a flow chart of the main steps of a method for improving the ability of a target re-recognition model according to one embodiment of the invention;
FIG. 2 is a schematic diagram of ReID detection head network and semantic segmentation network joint training according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for training a semantic segmentation network according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of a semantic segmentation feature map corresponding to a ReID feature map generated by a semantic segmentation network according to one embodiment of the present invention;
FIG. 5 is a schematic diagram of a second mask map aligned with a semantic segmentation feature map according to a first mask map acquisition in one embodiment of the invention;
fig. 6 is a schematic diagram of a process of training a ReID detection head network according to one embodiment of the invention.
Detailed Description
Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
In the description of the present invention, a "module," "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, or software components, such as program code, or a combination of software and hardware. The processor may be a central processor, a microprocessor, an image processor, a digital signal processor, or any other suitable processor. The processor has data and/or signal processing functions. The processor may be implemented in software, hardware, or a combination of both. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" means all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" has a meaning similar to "A and/or B" and may include A alone, B alone or A and B. The singular forms "a", "an" and "the" include plural referents.
As used herein, directional terms such as "front", "front side", "front", "rear side", and "rear" are used with reference to the fore-aft direction of a vehicle in which components are mounted to the vehicle. As referred to herein, "longitudinal", "longitudinal section" are referenced to the fore-and-aft direction of the component after installation in a vehicle, while "transverse", "cross section" are referenced to the longitudinal direction.
In order to solve the technical problems, in particular to solve the problems of low accuracy and poor generalization capability of the conventional target re-recognition model, the invention provides a method for improving the capability of the target re-recognition model and a target re-recognition method.
In a first aspect of the present invention, the present invention provides a method for improving the capability of a target re-recognition model, referring to fig. 1, including the following steps S1-S2:
s1, inputting the sample image into a backbone network to obtain a ReID feature map.
In one example, the sample image is a batch of pedestrian images, and after data enhancement such as random image filling, image clipping, image flipping, image erasing and the like is performed on the pedestrian images, the sample images are input into a backbone network to generate 2048x12x 6-dimensional ReID feature maps.
S2, respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein a total loss function used by the joint training is a weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so that an optimized backbone network and the ReID detection head network are obtained.
In one example, as shown in fig. 2, the encoder (i.e., backbone network) of the ReID detection head network and the semantic segmentation network are identical, weight sharing. And respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for joint training, obtaining loss functions of two network branches, and then carrying out weighted summation, so that model parameters of a backbone network and the ReID detection head network are optimized, and the target re-identification model capacity is improved.
It should be noted that, training two network branches using the same batch of sample images can improve accuracy and generalization capability of re-recognition of the target re-recognition model compared to training the ReID detection head network branches and the segmentation data set training semantic segmentation network branches with ReID data respectively.
In a specific embodiment, the method further comprises:
and inputting the sample image into a segmentation model, and obtaining a first mask image corresponding to the sample image, wherein the first mask image is used for distinguishing the region of interest from a background region.
For example, using Swin-B as an initial model, pre-training the weight of the initial model using the COCO component segmentation dataset, selecting a portion of the high quality images in the sample image as a training subset, manually labeling the region of interest and the background region, and then adding the training in batches. In the training process, pedestrians and personal belongings are continuously corrected manually to serve as the region of interest, the rest is served as the background region, and the segmentation model with higher accuracy is finally obtained through the process of iterative training model-reasoning-correcting results.
Further, a first mask map for distinguishing the region of interest from the background region can be obtained by inputting the sample image into the segmentation model, wherein the first mask map is a 01 binary matrix, 1 represents the region of interest, and 0 represents the background region.
It will be appreciated by those skilled in the art that the first mask map described above is not entered into the backbone network, but is used only when calculating the loss function of the semantic segmentation network.
In one embodiment, referring to fig. 3, in training the semantic segmentation network, the method includes the following steps S20-S22:
s20, generating a semantic segmentation feature map corresponding to the ReID feature map through the semantic segmentation network.
In one embodiment of the present invention, in one embodiment,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network is a 4-layer feature pyramid which performs step-by-step up-sampling from top to bottom, the 1 st pyramid layer performs 1*1 convolution on the output from the 5 th convolution layer and performs 2-time up-sampling, and the first fusion feature is obtained by fusing the output of the 1 st pyramid layer and the output of the 4 th convolution layer after 1*1 convolution; the 2 nd pyramid layer carries out up-sampling for 2 times on the first fusion feature, and carries out fusion with the output of the 3 rd convolution layer after 1*1 convolution to obtain a second fusion feature; the third pyramid layer carries out up-sampling for 2 times on the second fusion feature, and carries out fusion with the output of the third convolution layer after 1*1 convolution to obtain a third fusion feature; and the 4 th pyramid layer carries out up-sampling for 2 times on the third fusion feature, and carries out fusion with the output of the 2 nd convolution layer after 1*1 convolution to obtain the semantic segmentation feature map.
For example, as shown in fig. 4, the sample image generates a ReID feature map C5 (vector size [1,2048,12,6 ]) via 5 convolution layers of the resnet network, generates a first fused feature (vector size [1,64,12,6 ]), compresses the channel number by a convolution kernel of 1*1, upsamples one-fold (vector size [1,64,24,12 ]) using bilinear interpolation. The size of the left C4 is [1,1024,24,12], the channel number is compressed through 1*1 convolution kernels to obtain [1,64,24,12], the two feature maps are added and fused to generate a second fusion feature, and the like until the final layer outputs the semantic segmentation feature map (vector size [1,64,96,48 ]).
In one embodiment of the present invention, in one embodiment,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network multiplexes the first two convolution layers of the 5 convolution layers, wherein the 2 nd convolution layer outputs the semantic segmentation feature map.
In a further embodiment of the present invention,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network multiplexes the whole 5 convolution layers, wherein the 5 th convolution layer outputs the semantic segmentation feature map.
S21, acquiring a second mask map aligned with the semantic segmentation feature map according to the first mask map.
For example, the semantic segmentation feature map and the first mask map are classified pixel by pixel, so as to obtain a second mask map.
In a preferred example, the mask map may be subjected to data enhancement operations, such as random image filling, image cropping, image flipping, image erasure, etc., prior to pixel-by-pixel classification using the first mask map.
S22, inputting the second mask map and the semantic segmentation feature map into the second loss function, and optimizing the semantic segmentation network by taking the probability value in the semantic segmentation map corresponding to the region of interest in the second mask map as a high target and the probability value in the semantic segmentation map corresponding to the background region in the second mask map as a low target.
In a preferred embodiment, the second loss function of the semantic segmentation network is a binary cross entropy loss function:
L Seg =-(ylog(p(x))+(1-y)log(1-p(x))
wherein y is the value (1 or 0) of each pixel point in the second mask map, x is the value of the corresponding pixel point in the segmentation feature map, p () is a Sigmoid activation function, and the feature value is normalized to the probability value of 0-1. When y=1, L Seg -ylog (p (x)); when y=0, L Seg =-log(1-p(x))。
Further, optimization is performed with the objective that the probability value in the semantic segmentation map corresponding to the region of interest (y=1) in the second mask map is high (probability value is 1) and the probability value in the semantic segmentation map corresponding to the background region (y=0) in the second mask map is low (probability value is 0).
In a specific embodiment, the obtaining, according to the first mask map, a second mask map aligned with the semantic segmentation feature map includes:
pooling the first mask map in a maximum pooling mode to obtain a second mask map; or alternatively
And pooling the first mask map by voting to obtain a second mask map.
For example, as shown in fig. 5, the first mask pattern of one 4*4 pixel value is pooled into the second mask pattern of 2×2 pixel value, and one 1 to four 1 in each of four 2×2 bins. If the pooling is the maximum, 1 is only needed in 2x 2, and the pooling is 1; the pooling is performed by voting, which is to default to 1 (or 0) when 0 is more and 1 is more, so that the second mask pattern of the downsampling is more accurate than the maximum pooling.
In one embodiment, referring to fig. 6, in training the ReID detection head network, the method includes the following steps S23-S24:
s23, sequentially passing the ReID feature map through a pooling layer and a convolution layer, and obtaining the distance between positive and negative sample images through a triplet loss function;
s24, the characteristics obtained after convolution sequentially pass through a normalization layer and a full connection layer, and a classification result is obtained through a cross entropy function.
In a specific embodiment, the method comprises:
and coordinating the convergence speed of the ReID detection head network and the semantic segmentation network by dynamically adjusting the weighting coefficient of the second loss function.
Further, the dynamic weight adjustment mode enables the weight of the separation loss function to increase along with the number of training iteration rounds, and compared with the direct weighted summation mode, the convergence rate of the network model can be obviously improved.
In a specific embodiment, the total loss function is:
wherein Lreid is the first loss function, lseg is the second loss function, α is the super parameter, cur_iter is the current iteration number, and total_iter is the total iteration number.
In a specific embodiment, before the joint training of the ReID detection header network and the semantic segmentation network, the method further comprises:
and training the backbone network and the ReID detection head network for preset times in advance.
In one example, after the ReID feature map is input into the ReID detection head network to perform several rounds of training, the semantic segmentation network branches are added into the training, and the encoders (i.e. backbone networks) of the semantic segmentation network and the ReID detection head network are identical, and the weights are shared. Through delaying training semantic segmentation network branches, segmentation feature graphs with higher semanteme can be extracted, and the whole model is promoted.
In a second aspect of the present invention, the present invention further provides a target re-recognition method, including:
target re-identification is performed using a ReID detection head network optimized by the method according to any one of the first aspects of the invention.
It should be noted that, the semantic segmentation network only serves as an auxiliary network to add constraint to the ReID detection head network, so that the model can promote attention to the region of interest and reduce attention to the background region, and the generalization capability of the model is remarkably improved. In the test reasoning stage, semantic segmentation network branches can be removed, network calculation amount is reduced, and reasoning speed is not influenced.
In a specific embodiment, the target comprises one or more of a pedestrian and a vehicle.
Although the pedestrian is taken as the target in the target re-recognition in the embodiment of the present invention, other targets such as a vehicle may be used in practical application, and the present invention is not limited thereto.
It will be appreciated by those skilled in the art that the present invention may implement all or part of the above-described methods according to the above-described embodiments, or may be implemented by means of a computer program for instructing relevant hardware, where the computer program may be stored in a computer readable storage medium, and where the computer program may implement the steps of the above-described embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include: any entity or device, medium, usb disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunications signals, software distribution media, and the like capable of carrying the computer program code. It should be noted that the computer readable storage medium may include content that is subject to appropriate increases and decreases as required by jurisdictions and by jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunications signals.
Further, the invention also provides a control device. In one control device embodiment according to the present invention, the control device includes a processor and a storage device, the storage device may be configured to store a program for executing the method for improving the capability of the target re-recognition model and the target re-recognition method of the above-described method embodiment, and the processor may be configured to execute the program in the storage device, including, but not limited to, the program for executing the method for improving the capability of the target re-recognition model and the target re-recognition method of the above-described method embodiment. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The control device may be a control device formed of various electronic devices.
Further, the invention also provides a computer readable storage medium. In one embodiment of the computer readable storage medium according to the present invention, the computer readable storage medium may be configured to store a program for executing the above-described method for improving the capability of the target re-recognition model and the target re-recognition method of the method embodiment, the program being loadable and executable by a processor to implement the above-described method for improving the capability of the target re-recognition model and the target re-recognition method. For convenience of explanation, only those portions of the embodiments of the present invention that are relevant to the embodiments of the present invention are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present invention. The computer readable storage medium may be a storage device including various electronic devices, and optionally, the computer readable storage medium in the embodiments of the present invention is a non-transitory computer readable storage medium.
Further, it should be understood that, since the respective modules are merely set to illustrate the functional units of the apparatus of the present invention, the physical devices corresponding to the modules may be the processor itself, or a part of software in the processor, a part of hardware, or a part of a combination of software and hardware. Accordingly, the number of individual modules in the figures is merely illustrative.
Those skilled in the art will appreciate that the various modules in the apparatus may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solution to deviate from the principle of the present invention, and therefore, the technical solution after splitting or combining falls within the protection scope of the present invention.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention. .

Claims (13)

1. A method for improving the ability of a target re-recognition model, comprising:
inputting the sample image into a backbone network to obtain a ReID feature map;
and respectively inputting the ReID feature map into a ReID detection head network and a semantic segmentation network for acquiring a region of interest of the sample image, and carrying out joint training on the ReID detection head network and the semantic segmentation network, wherein the total loss function used by the joint training is the weighted summation of a first loss function used by the ReID detection head network and a second loss function used by the semantic segmentation network, so as to obtain an optimized backbone network and the ReID detection head network.
2. The method according to claim 1, wherein the method further comprises:
and inputting the sample image into a segmentation model, and obtaining a first mask image corresponding to the sample image, wherein the first mask image is used for distinguishing the region of interest from a background region.
3. The method of claim 2, wherein during training of the semantic segmentation network, the method comprises:
generating a semantic segmentation feature map corresponding to the ReID feature map through the semantic segmentation network;
acquiring a second mask map aligned with the semantic segmentation feature map according to the first mask map;
inputting the second mask map and the semantic segmentation feature map into the second loss function, and optimizing the semantic segmentation network by taking the probability value in the semantic segmentation map corresponding to the region of interest in the second mask map as a high target and the probability value in the semantic segmentation map corresponding to the background region in the second mask map as a low target.
4. The method of claim 3, wherein the step of,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network is a 4-layer feature pyramid which performs step-by-step up-sampling from top to bottom, the 1 st pyramid layer performs 1*1 convolution on the output from the 5 th convolution layer and performs 2-time up-sampling, and the first fusion feature is obtained by fusing the output of the 1 st pyramid layer and the output of the 4 th convolution layer after 1*1 convolution; the 2 nd pyramid layer carries out up-sampling for 2 times on the first fusion feature, and carries out fusion with the output of the 3 rd convolution layer after 1*1 convolution to obtain a second fusion feature; the third pyramid layer carries out up-sampling for 2 times on the second fusion feature, and carries out fusion with the output of the third convolution layer after 1*1 convolution to obtain a third fusion feature; and the 4 th pyramid layer carries out up-sampling for 2 times on the third fusion feature, and carries out fusion with the output of the 2 nd convolution layer after 1*1 convolution to obtain the semantic segmentation feature map.
5. The method of claim 3, wherein the step of,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network multiplexes the first two convolution layers of the 5 convolution layers, wherein the 2 nd convolution layer outputs the semantic segmentation feature map.
6. The method of claim 3, wherein the step of,
the backbone network is a resnet network and comprises 5 convolution layers which are downsampled step by step from bottom to top, wherein the ith convolution layer carries out 2 on the input image characteristics i The 5 th convolution layer outputs the ReID characteristic diagram, wherein i is more than or equal to 1 and less than or equal to 5;
the semantic segmentation network multiplexes the whole 5 convolution layers, wherein the 5 th convolution layer outputs the semantic segmentation feature map.
7. The method according to any of claims 3-6, wherein the obtaining a second mask map aligned with the semantic segmentation feature map from the first mask map comprises:
pooling the first mask map in a maximum pooling mode to obtain a second mask map; or alternatively
And pooling the first mask map by voting to obtain a second mask map.
8. The method according to claim 1, wherein in training the ReID detection head network, the method comprises:
sequentially passing the ReID feature map through a pooling layer and a convolution layer, and obtaining the distance between positive and negative sample images through a triplet loss function;
and the characteristics obtained after convolution sequentially pass through a normalization layer and a full connection layer, and a classification result is obtained through a cross entropy function.
9. The method according to claim 1, characterized in that the method comprises:
and coordinating the convergence speed of the ReID detection head network and the semantic segmentation network by dynamically adjusting the weighting coefficient of the second loss function.
10. The method of claim 9, wherein the total loss function is:
wherein L is reid As the first loss function, the L seg For the second loss function, α is the super parameter, cur_iter is the current iteration number, and total_iter is the total iteration number.
11. The method of claim 1, wherein prior to co-training the ReID detection header network and semantic segmentation network, the method further comprises:
and training the backbone network and the ReID detection head network for preset times in advance.
12. A method of target re-identification, comprising:
target re-identification using a ReID detection head network optimized according to the method of any one of claims 1-11.
13. The method of claim 12, wherein the target comprises one or more of a pedestrian and a vehicle.
CN202310512340.1A 2023-05-08 2023-05-08 Method for improving target re-identification model capacity and target re-identification method Pending CN116664833A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310512340.1A CN116664833A (en) 2023-05-08 2023-05-08 Method for improving target re-identification model capacity and target re-identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310512340.1A CN116664833A (en) 2023-05-08 2023-05-08 Method for improving target re-identification model capacity and target re-identification method

Publications (1)

Publication Number Publication Date
CN116664833A true CN116664833A (en) 2023-08-29

Family

ID=87712739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310512340.1A Pending CN116664833A (en) 2023-05-08 2023-05-08 Method for improving target re-identification model capacity and target re-identification method

Country Status (1)

Country Link
CN (1) CN116664833A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392181A (en) * 2023-12-13 2024-01-12 安徽蔚来智驾科技有限公司 Motion information prediction method, computer equipment, storage medium and intelligent equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392181A (en) * 2023-12-13 2024-01-12 安徽蔚来智驾科技有限公司 Motion information prediction method, computer equipment, storage medium and intelligent equipment

Similar Documents

Publication Publication Date Title
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
Hu et al. Underwater image restoration based on convolutional neural network
CN111292264A (en) Image high dynamic range reconstruction method based on deep learning
US20100232685A1 (en) Image processing apparatus and method, learning apparatus and method, and program
CN113592736A (en) Semi-supervised image deblurring method based on fusion attention mechanism
CN111079764B (en) Low-illumination license plate image recognition method and device based on deep learning
CN113762209A (en) Multi-scale parallel feature fusion road sign detection method based on YOLO
CN112149476B (en) Target detection method, device, equipment and storage medium
CN113065645A (en) Twin attention network, image processing method and device
CN112784834A (en) Automatic license plate identification method in natural scene
CN116664833A (en) Method for improving target re-identification model capacity and target re-identification method
CN114463218A (en) Event data driven video deblurring method
CN114359669A (en) Picture analysis model adjusting method and device and computer readable storage medium
CN115880177A (en) Full-resolution low-illumination image enhancement method for aggregating context and enhancing details
CN117409083B (en) Cable terminal identification method and device based on infrared image and improved YOLOV5
CN110942097A (en) Imaging-free classification method and system based on single-pixel detector
CN113379861B (en) Color low-light-level image reconstruction method based on color recovery block
CN113393385B (en) Multi-scale fusion-based unsupervised rain removing method, system, device and medium
CN113628143A (en) Weighted fusion image defogging method and device based on multi-scale convolution
CN117011160A (en) Single image rain removing method based on dense circulation network convergence attention mechanism
KR102514531B1 (en) Apparatus and method for improving images in poor weather
CN114565597B (en) Night road pedestrian detection method based on YOLO v3-tiny-DB and transfer learning
CN114119428B (en) Image deblurring method and device
Abbasi et al. Fog-aware adaptive yolo for object detection in adverse weather
Zhou et al. Multi-scale and attention residual network for single image dehazing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination