WO2023142551A1

WO2023142551A1 - Model training and image recognition methods and apparatuses, device, storage medium and computer program product

Info

Publication number: WO2023142551A1
Application number: PCT/CN2022/127109
Authority: WO
Inventors: 唐诗翔; 朱烽; 赵瑞
Original assignee: 上海商汤智能科技有限公司
Priority date: 2022-01-28
Filing date: 2022-10-24
Publication date: 2023-08-03
Also published as: CN114445681A

Abstract

Embodiments of the present disclosure provide model training and image recognition methods and apparatuses, a device, a storage medium and a computer program product. The model training method comprises: acquiring a first image sample containing a first object; performing feature extraction on the first image sample using a first network of a first model to be trained, to obtain a first feature of the first object; updating the first feature on the basis of a second feature of at least one second object, using a second network of the first model, to obtain a first target feature corresponding to the first feature, a degree of similarity between each second object and the first object being not less than a first threshold; determining a target loss value on the basis of the first target feature; and updating a model parameter of the first model at least once on the basis of the target loss value, to obtain a trained first model.

Description

Model training and image recognition method and device, equipment, storage medium and computer program product

Cross References to Related Applications

The embodiment of the present disclosure is based on the Chinese patent application with the application number 202210107742.9, the application date is January 28, 2022, and the application name is "model training and image recognition method and device, equipment and storage medium", and requires the Chinese patent application The entire content of this Chinese patent application is hereby incorporated into this disclosure as a reference.

technical field

The present disclosure relates to but not limited to the field of computer technology, and in particular relates to a model training and image recognition method and device, device, storage medium and computer program product.

Background technique

Object re-identification (Object re-identification), also known as object re-identification, is a technology that uses computer vision technology to determine whether a specific object exists in an image or video sequence. Object re-identification is widely considered as a subproblem of image retrieval, i.e., given an image containing an object, retrieve images containing that object across devices. The differences between devices, shooting angles, environments and other factors will all affect the results of object re-identification. Contents of the invention

Embodiments of the present disclosure provide a model training and image recognition method, device, device, storage medium, and computer program product.

The technical scheme of the embodiment of the present disclosure is realized in this way:

An embodiment of the present disclosure provides a model training method, which includes:

obtaining a first image sample containing a first object;

Using the first network of the first model to be trained, performing feature extraction on the first image sample to obtain the first feature of the first object;

Using the second network of the first model, based on the second feature of at least one second object, the first feature is updated to obtain the first target feature corresponding to the first feature, and the similarity between each second object and the first object not less than the first threshold;

determining a target loss value based on the first target feature;

Based on the target loss value, the model parameters of the first model are updated at least once to obtain the trained first model.

An embodiment of the present disclosure provides an image recognition method, the method comprising:

acquire the first image and the second image;

Use the trained target model to identify the object in the first image and the object in the second image to obtain the recognition result, wherein the trained target model includes: the first model obtained by the above-mentioned model training method; the recognition result representation The object in the first image and the object in the second image are the same object or different objects.

An embodiment of the present disclosure provides a model training device, which includes:

a first acquisition part configured to acquire a first image sample containing a first object;

The feature extraction part is configured to use the first network of the first model to be trained to perform feature extraction on the first image sample to obtain the first feature of the first object;

The first update part is configured to use the second network of the first model to update the first features respectively based on the second features of at least one second object to obtain the first target features corresponding to the first features, and each of the second objects The similarity between the second object and the first object is not less than the first threshold;

The first determination part is configured to determine a target loss value based on the first target feature;

The second updating part is configured to update the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

An embodiment of the present disclosure provides an image recognition device, which includes:

a second acquisition part configured to acquire the first image and the second image;

The identification part is configured to use the trained target model to identify the object in the first image and the object in the second image to obtain a recognition result, wherein the trained target model includes: the first obtained by using the above model training method A model; the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.

An embodiment of the present disclosure provides an electronic device, including a processor and a memory, the memory stores a computer program that can run on the processor, and the above method is implemented when the processor executes the computer program.

An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the foregoing method is implemented.

An embodiment of the present disclosure provides a computer program product, where the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on the electronic device, the electronic device is made to execute the above method.

In the embodiment of the present disclosure, by acquiring the first image sample containing the first object; using the first network of the first model to be trained, performing feature extraction on the first image sample to obtain the first feature of the first object; using The second network of the first model updates the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, and the similarity between each second object and the first object is different. less than the first threshold; determining a target loss value based on the first target feature; and updating model parameters of the first model at least once based on the target loss value to obtain a trained first model. In this way, the characteristics of the second object are introduced as noise at the feature level of the first image sample containing the first object, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced and the first model can be improved. At the same time, when the target loss value does not meet the preset conditions, the model parameters of the first model are updated at least once. Since the target loss value is determined based on the first target feature, the first model after training can be improved. The consistency of the model's predictions for different image samples of the same object can further enable the trained first model to more accurately re-identify objects in images containing multiple objects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory in nature and are not restrictive of the disclosure.

Description of drawings

The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

FIG. 1 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an implementation flow of an image recognition method provided by an embodiment of the present disclosure;

FIG. 5A is a schematic diagram of the composition and structure of a model training system provided by an embodiment of the present disclosure;

FIG. 5B is a schematic diagram of a model training system provided by an embodiment of the present disclosure;

FIG. 5C is a schematic diagram of determining an occlusion mask provided by an embodiment of the present disclosure;

FIG. 5D is a schematic diagram of a first network provided by an embodiment of the present disclosure;

FIG. 5E is a schematic diagram of a second subnetwork provided by an embodiment of the present disclosure;

FIG. 5F is a schematic diagram of a second network provided by an embodiment of the present disclosure;

FIG. 5G is a schematic diagram of obtaining a target loss value provided by an embodiment of the present disclosure;

FIG. 5H is a schematic diagram of an occlusion score of a pedestrian image provided by an embodiment of the present disclosure;

FIG. 5I is a schematic diagram of an image retrieval result provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of the composition and structure of a model training device provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of the composition and structure of an image recognition device provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a hardware entity of an electronic device in an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with the accompanying drawings. All other embodiments obtained under the premise of creative labor belong to the protection scope of the present disclosure. In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. In the following description, the terms "first\second\third" are used to distinguish similar objects, and do not represent a specific ordering of objects. Understandably, "first\second\third" is allowed The specific order or sequence may be interchanged under certain circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terms used herein are only for the purpose of describing the embodiments of the present disclosure, and are not intended to limit the present disclosure.

In related technologies, most algorithms use deep neural networks to extract image features, and then implement retrieval functions through distance metrics. However, due to the complexity of pedestrian re-identification scenarios, target pedestrians are often occluded by non-pedestrian objects or disturbed by non-target pedestrians. These algorithms do not realize the impact of occlusion on the retrieval accuracy, and the extracted pedestrian feature representation contains a lot of noise, which reduces the retrieval accuracy. Although some existing algorithms introduce human body parsing algorithms or pose estimation algorithms to assist pedestrian re-identification models to extract pedestrian features, the robustness of human body parsing and pose estimation algorithms is not high, it is difficult to provide accurate auxiliary information, and even mislead the model to perform wrong features. extraction, reducing the accuracy of retrieval.

The embodiment of the present disclosure provides a model training method, which introduces the features of the second object as noise at the feature level of the first image sample containing the first object, and trains the overall network structure of the first model, so that the first model can be enhanced Robustness and improve the performance of the first model, at the same time, when the target loss value does not meet the preset conditions, the model parameters of the first model are updated at least once, because the target loss value is determined based on the first target feature , so that the prediction consistency of the trained first model for different image samples of the same object can be improved, thereby enabling the trained first model to more accurately re-identify objects in images containing multiple objects. Both the model training method and the image recognition method provided by the embodiments of the present disclosure can be executed by electronic equipment, and the electronic equipment can be a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital Various types of terminals such as assistants, dedicated messaging devices, portable game devices) can also be implemented as servers. The server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, intermediate Cloud servers for basic cloud computing services such as mail service, domain name service, security service, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms. In the following, the technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the drawings in the embodiments of the present disclosure.

Fig. 1 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 1, the method includes steps S11 to S15, wherein:

Step S11, acquiring a first image sample including a first object.

Here, the first image sample may be any suitable image containing at least the first object. Content contained in the first image sample may be determined according to an actual application scenario, for example, only the first object, or at least one of the first object and the object, or other objects. The first object may include, but is not limited to, people, animals, plants, objects, and the like. For example, the first image sample is a face image containing Zhang San. For another example, the first image sample is an image including Li Si's whole person. In some implementations, the first image sample may include at least one image. For example, the first image sample is any image in the training set. For another example, the first image sample includes a first sub-image and a second sub-image, wherein the first sub-image is an image in the training set, and the second sub-image is an image obtained by augmenting the first sub-image. Wherein, the augmentation processing may include but not limited to at least one of occlusion processing, scaling processing, cropping processing, size adjustment processing, filling processing, flipping processing, color dithering processing, grayscale processing, Gaussian blur processing, random erasing processing, etc. kind. During implementation, those skilled in the art may use appropriate augmentation processing on the first sub-image to obtain the second sub-image according to actual conditions, which is not limited in the embodiments of the present disclosure. Also for example, the first image sample includes a first sub-image and a plurality of second sub-images, wherein the first sub-image is an image in the training set, and each second sub-image is an augmentation process on the first sub-image image obtained after.

Step S12 , using the first network of the first model to be trained, to perform feature extraction on the first image sample to obtain the first feature of the first object.

Here, the first model may be any suitable model for object recognition based on image features. The first model may include at least a first network. The first feature may include, but not limited to, the original feature of the first image sample, or a feature obtained by processing the original feature. The original feature may include but not limited to the face feature, body feature, etc. of the first object included in the image. In some implementations, the first network may at least include a first sub-network, and the first sub-network is used to extract features of the first image using a feature extractor. The feature extractor may include, but not limited to, a recurrent neural network (Recurrent Neural Network, RNN), a convolutional neural network (Convolutional Neural Network, CNN), a feature extraction network based on a converter (Transform), etc. During implementation, those skilled in the art may use an appropriate first network in the first model to obtain the first feature according to actual conditions, which is not limited in the embodiments of the present disclosure. For example, the third feature of the first image sample is extracted through the first sub-network, and the third feature is determined as the first feature of the first object. Here, the third feature may include, but not limited to, the original feature of the first image sample and the like.

In some implementations, the first network may further include a second sub-network for determining the first feature of the first object based on the third feature of the first image sample. In some implementations, the second sub-network may include an occlusion erasure network, which is used to perform occlusion erasure processing on the input third feature to obtain the first feature of the first object.

Step S13 , using the second network of the first model to update the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature.

Here, the similarity between each second object and the first object is not less than the first threshold. Wherein, the first threshold may be preset or obtained by statistics. During implementation, those skilled in the art may independently determine the setting manner of the first threshold according to actual needs, which is not limited in the embodiments of the present disclosure. For example, the similarity between the facial features of the second object and the first object is not less than the first threshold. For another example, the similarity between the wearing features of the second object and the first object is not less than the first threshold. For another example, neither the similarity between the appearance characteristics of the second object nor the similarity between the clothing characteristics of the first object is less than the first threshold.

The second feature can be obtained based on the training set, or can be pre-input. The second object may include, but is not limited to, people, animals, plants, objects, and the like.

In some implementations, the similarity between each second object and the first object may be obtained based on the similarity between the second feature of each second object and the first feature of the first object. In some implementations, the similarity between each second object and the first object may be obtained based on the similarity between the feature center of each second object and the first feature of the first object. Wherein, the first model may include a second memory feature library, and the second memory feature library may include at least one feature of at least one object. The feature center of the second object may be obtained based on at least one feature belonging to the second object in the second memory feature library. In some implementations, features of multiple image samples of at least one object in the training set may be extracted, and the extracted features may be stored in the second memory feature library according to their identity.

In some embodiments, the second network may include a fifth sub-network and a sixth sub-network, the fifth sub-network is used to aggregate the second feature with the first feature to obtain the first aggregated sub-feature; the sixth sub-network The network is used to update the first aggregation sub-feature to obtain the first target feature.

Step S14. Determine the target loss value based on the first target feature.

Here, the target loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.

Step S15 , based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.

Here, based on the target loss value, it may be determined whether to update the model parameters of the first model. For example, the target loss value is compared with the threshold value, and if the target loss value is greater than the threshold value, the model parameters of the first model are updated; when the target loss value is not greater than the threshold value, the first model is determined as training After the first model. For another example, compare the target loss value with the last target loss value, and update the model parameters of the first model if the target loss value is greater than the last target loss value; In the case of a target loss value once, the first model is determined as the first model after training.

In the embodiment of the present disclosure, by acquiring the first image sample containing the first object; using the first network of the first model to be trained, performing feature extraction on the first image sample to obtain the first feature of the first object; using The second network of the first model updates the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, and the similarity between each second object and the first object is different. less than the first threshold; determine the target loss value based on the first target feature; and update the model parameters of the first model at least once based on the target loss value to obtain the trained first model. In this way, the characteristics of the second object are introduced as noise at the feature level of the first image sample containing the first object, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced and the first model can be improved. At the same time, when the target loss value does not meet the preset conditions, the model parameters of the first model are updated at least once. Since the target loss value is determined based on the first target feature, the first model after training can be improved. The consistency of the model's predictions for different image samples of the same object can further enable the trained first model to more accurately re-identify objects in images containing multiple objects.

In some embodiments, the first image sample includes label information, the first model includes a first feature memory library, and the first feature memory library includes at least one feature belonging to at least one object; the above step S14 includes step S141 to step S143, in:

Step S141. Determine a first loss value based on the first target feature and label information.

Here, tag information may include, but not limited to, tag values, identifiers, and the like. The first loss value may include, but not limited to, a cross-entropy loss value and the like. In some embodiments, the first loss value can be calculated by the following formula (1-1):

Among them, W is a linear matrix, W _i and W _j are the elements in W, y _i represents the label information of the i-th object, f _i represents the first target feature of the i-th object, ID _S represents the total number of objects in the training set .

Step S142: Determine a second loss value based on the first target feature and at least one feature of at least one object in the first feature memory.

Here, at least one feature of the first object and at least one feature of at least one second object are stored in the first feature storage. The second loss value may include but not limited to contrastive loss and the like.

Step S143: Determine a target loss value based on the first loss value and the second loss value.

Here, the target loss value may include, but not limited to, the sum of the first loss value and the second loss value, the sum after weighting the first loss value and the second loss value respectively, and the like. During implementation, those skilled in the art may determine the target loss value according to actual needs, which is not limited by the embodiments of the present disclosure. In some embodiments, the target loss value can be calculated by the following formula (1-2):

in,

represents the first loss value,

Indicates the second loss value.

In some embodiments, step S142 includes step S1421 to step S1422, wherein:

Step S1421. From at least one feature of at least one object in the first feature memory, determine a first feature center of the first object and a second feature center of at least one second object.

In some implementations, the first feature center may be determined based on the features of the first object in the first feature memory and the first target feature. Each second feature center may be determined based on each feature of each second object in the second feature memory. In some embodiments, the feature center of each object can be calculated by the following formula (1-3):

Among them, c _k represents the feature center of the k-th object, B _k represents the feature set belonging to the k-th object in the mini-batch, m is the set updated momentum coefficient, f _i ′ is the first feature of the i-th sample . In some embodiments, m can be 0.2.

In some implementations, when f _i ' and B _k both belong to the same object, the feature center c _k belonging to the object will change, and in the case that f _i ' and B _k do not belong to the same object, the feature center c k belonging to the object The feature center c _k is consistent with the previous c _k .

Step S1422. Determine a second loss value based on the first target feature, the first feature center and each second feature center.

In some embodiments, the second loss value can be calculated by the following formula (1-4):

Among them, τ is a predefined temperature parameter, c _i represents the first feature center of the i-th object, c _j represents each second feature center, f _i represents the first target feature of the i-th object, ID _S represents the training The total number of objects in the set.

In some embodiments, the above step S15 includes step S151 or step S152, wherein:

Step S151 , if the target loss value does not meet the preset condition, update the model parameters of the first model to obtain an updated first model; based on the updated first model, determine a trained first model.

Here, the manner of updating the model parameters of the first model may include but not limited to at least one of gradient descent method, momentum update method, Newton momentum method and the like. During implementation, those skilled in the art may independently determine the update mode according to actual needs, which is not limited in the embodiments of the present disclosure.

Step S152, if the target loss value satisfies the preset condition, determine the updated first model as the trained first model.

Here, the preset conditions may include, but are not limited to, the target loss value being smaller than a threshold, the change of the target loss value converging, and the like. During implementation, those skilled in the art may independently determine the preset conditions according to actual needs, which are not limited by the embodiments of the present disclosure.

In some implementations, determining the first model after training based on the updated first model in step S151 includes steps S1511 to S1515, wherein:

Step S1511, acquiring the next first image sample;

Step S1512. Using the updated first network of the first model to be trained, perform feature extraction on the next first image sample to obtain the next first feature;

Step S1513, using the updated second network of the first model to update the next first feature based on the second feature of at least one second object, to obtain the next first target feature corresponding to the next first feature;

Step S1514, based on the next first target feature, determine the next target loss value;

Step S1515. Based on the next target loss value, perform at least one next update on the model parameters of the updated first model to obtain the trained first model.

Here, the above step S1511 to step S1515 correspond to the above step S11 to step S15 respectively, and for implementation, reference may be made to the implementation manner of the above step S11 to step S15.

In the embodiment of the present disclosure, when the target loss value does not meet the preset condition, the model parameters of the first model are updated next time, and the first model after training is determined based on the first model after the next update, so that The performance of the trained first model can be further improved through continuous iterative updating.

In some embodiments, the first feature memory library includes feature sets belonging to at least one object, each feature set includes at least one feature of the object to which it belongs, and the method further includes step S16, wherein:

Step S16 , based on the first target feature, update the feature set belonging to the first object in the first feature storage.

Here, the way of updating may include but not limited to adding the first target feature to the first feature storage, replacing a certain feature in the first feature storage with the first target feature, and so on.

In the embodiments of the present disclosure, by updating the features of the first object in the first feature memory library, the first feature center belonging to the first object can be accurately obtained, which further improves the recognition accuracy of the trained first model.

Fig. 2 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method includes steps S21 to S25, wherein:

Step S21, acquiring a first sub-image and a second sub-image containing the first object.

Here, the second sub-image may be an image after at least occlusion processing is performed on the first sub-image. The second sub-image may include at least one image. In some implementations, when the second sub-image includes multiple images, the multiple images may be images obtained by at least performing occlusion processing on the first sub-image respectively. Performing at least occlusion processing may include but not limited to only occlusion processing, or occlusion processing and other processing, and the like. In some embodiments, other processing may include, but not limited to, at least one of scaling, cropping, resizing, filling, flipping, color dithering, grayscale, Gaussian blur, and random erasing. A sort of. During implementation, those skilled in the art may use an appropriate processing method on the first sub-image to obtain the second sub-image according to actual conditions, which is not limited in the embodiments of the present disclosure.

In some embodiments, step S21 includes step S211 to step S212, wherein:

Step S211, acquiring a first sub-image including a first object.

Here, the first sub-image may be any suitable image containing at least the first object. The content contained in the first sub-image may be determined according to an actual application scene, for example, only include the first object, or include at least one of the first object and an object, or other objects. The first object may include, but is not limited to, people, animals, plants, objects, and the like. For example, the first sub-image is a face image containing Zhang San. For another example, the first sub-image is an image including Li Si's whole person.

Step S212 , based on the preset occlusion set, perform at least occlusion processing on the first sub-image to obtain a second sub-image.

Here, the occlusion set includes at least one occlusion image. The occlusion set may include, but is not limited to, one established based on at least one of a training set, other images, and the like. Wherein, the occlusion set includes at least a variety of occlusion object images, background images, etc., such as leaves, vehicles, trash cans, buildings, trees, flowers, and the like. For example, find image samples occluded by background and objects in the training set, and manually crop out the occluded parts to form an occlusion library. For another example, a suitable image containing at least one object occlusion is selected, and the occlusion part is manually cut out to form an occlusion library. During implementation, those skilled in the art may choose an appropriate way to establish an occlusion set according to actual requirements, which is not limited by the embodiments of the present disclosure.

The position of the occluder may include, but not limited to, a specified position, a specified size, and the like. In some implementations, since occlusion often occurs in the quarter to half of the four positions of top, bottom, left and right, the specified position can be set as a quarter of the four positions in one to half of the area. During implementation, those skilled in the art may determine the position of the barrier according to actual needs, which is not limited by the embodiments of the present disclosure.

In some implementations, performing at least occlusion processing may include, but is not limited to, occlusion processing and other processing. For example, in the case of at least performing occlusion processing including occlusion processing and size adjustment, the occlusion image is randomly selected from the occlusion library, and the size of the occlusion image is adjusted based on the adjustment rules. Based on the preset rules, the resized occlusion The object image is pasted in the lower right corner of the first image sample. Wherein, the adjustment rule may include but not limited to adjusting the size of the occluder image, adjusting the size of the first image sample, and the like. For example, if the height of the occluder image exceeds twice the width of the occluder image, it is considered to be vertical occlusion, and the height of the occluder image can be adjusted to the vertical height of the occluder image, and the width of the occluder image can be adjusted to the first image sample 1/4 to 1/2 of the width of the occluder image; otherwise, it is regarded as horizontal occlusion, and the width of the occluder image can be adjusted to the horizontal width of the occluder image, and the height of the occluder image can be adjusted to the height of the first image sample One quarter to one half. During implementation, those skilled in the art may determine the adjustment rule according to actual needs, which is not limited by the embodiments of the present disclosure. For another example, in the case of at least performing occlusion processing including occlusion processing, resizing processing, filling processing, and cropping processing, firstly, perform resizing processing, padding processing, and cropping processing on the first image sample; Select an occluder image, and resize the occluder image based on adjustment rules; then, based on preset rules, randomly select a corner of the first image sample as a starting point, and paste the resized occluder image to the starting point place.

In some embodiments, the method also includes step S213, wherein:

Step S213, based on the first sub-image and the second sub-image, determine an occlusion mask.

Here, the occlusion mask is used to represent the occlusion information of the image. The occlusion mask can be used for training the first model on object occlusion. In some implementations, the occlusion mask may be determined based on pixel differences between the first sub-image and the second sub-image. During implementation, the difference between the first sub-image and the second sub-image can be calculated based on the following formula (2-1):

d=|x-x′| (2-1);

Wherein, x represents the first sub-image, and x' represents the second sub-image.

In some implementations, the above step S213 includes steps S2131 to S2133, wherein:

Step S2131. Divide the first sub-image and the second sub-image into at least one first sub-part image and at least one second sub-part image respectively.

In some implementations, fine-grained occlusion masks tend to have many false labels due to misalignment of semantics (e.g., body parts) between different images, so the first sub-image and the second sub-image can be roughly horizontally Divided into a plurality of parts, the occlusion mask is determined based on pixel differences between each part of the first sub-image and each part of the second sub-image. For example, divided into four parts, divided into five parts, etc. During implementation, those skilled in the art may divide the first sub-image and the second sub-image according to actual requirements, which is not limited in the embodiments of the present disclosure.

Step S2132, based on each first sub-part image and each second sub-part image, determine an occlusion sub-mask.

In some implementations, the pixel difference between each first sub-part image and each second sub-part image can be obtained based on the above formula (2-1), and based on the pixel difference of each part, determine the mask.

Step S2133: Determine an occlusion mask based on each occlusion sub-mask.

In some implementations, if d _i is not less than the first threshold, it indicates that there is occlusion in this part of the image. At this time, the occlusion sub-mask mask _i can be set to 0. Otherwise, it indicates that there is no occlusion in this part. At this time mask _i can be set to 1, then the corresponding occlusion mask mask is the occlusion sub-mask of each part. For example, the first sub-image and the second sub-image are divided into four parts, in the case that there is no occlusion in the first, second and third parts, and there is occlusion in the fourth part, then the occlusion mask mask at this time should be 1110. During implementation, those skilled in the art may determine the occlusion mask according to actual needs, which is not limited by the embodiments of the present disclosure.

Step S22. Using the first network of the first model to be trained, perform feature extraction on the first sub-image to obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image to obtain the first sub-feature of the first object. Two sub-features.

Here, the first model may be any suitable model for object recognition based on image features. The first model may include at least a first network. The first sub-feature may include, but not limited to, the original feature of the first sub-image, or a feature obtained by processing the original feature. The second sub-feature may include, but not limited to, the original feature of the second sub-image, or a feature obtained by processing the original feature. The original features may include but not limited to facial features, body features, etc. of the objects contained in the image.

Step S23, using the second network of the first model, based on the second feature of at least one second object, to update the first sub-feature and the second sub-feature respectively, to obtain the first target sub-feature and the first target sub-feature corresponding to the first sub-feature The second target sub-feature corresponding to the second sub-feature.

Step S24: Determine a target loss value based on the first target sub-feature and the second target sub-feature.

Step S25 , based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.

Here, the above-mentioned step S25 corresponds to the above-mentioned step S15, and the implementation manner of the above-mentioned step S15 can be referred to for implementation.

In the embodiment of the present disclosure, by acquiring the first sub-image and the second sub-image containing the first object, the second sub-image is an image after at least occlusion processing is performed on the first sub-image; using the first model to be trained The first network performs feature extraction on the first sub-image to obtain the first sub-feature of the first object, and performs feature extraction on the second sub-image to obtain the second sub-feature of the first object; using the second sub-feature of the first model The network, based on the second feature of at least one second object, respectively updates the first sub-feature and the second sub-feature to obtain the first target sub-feature corresponding to the first sub-feature and the second target sub-feature corresponding to the second sub-feature feature, the similarity between each second object and the first object is not less than the first threshold; based on the first target sub-feature and the second target sub-feature, determine the target loss value; based on the target loss value, the model parameters of the first model At least one update is performed to obtain the trained first model. In this way, at the image level and feature level of the first image sample containing the first object, the features of the object image and other objects are respectively introduced as noise, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced. and improve the performance of the first model. At the same time, when the target loss value does not meet the preset conditions, the model parameters of the first model are updated at least once. Since the target loss value is determined based on the first target feature, it can be Improve the consistency of the prediction of the first model after training for different image samples of the same object, so that the first model after training can more accurately predict objects in images containing object occlusion and/or multiple objects Re-identify.

In some embodiments, step S24 includes step S241 to step S243, wherein:

Step S241: Determine a first target loss value based on the first target sub-feature and the second target sub-feature.

Here, the first target loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.

In some embodiments, step S241 includes step S2411 to step S2413, wherein:

Step S2411. Based on the first target sub-feature, determine a third target sub-loss value.

Here, the above-mentioned step S2411 corresponds to the above-mentioned step S14, and the implementation manner of the above-mentioned step S14 can be referred to for implementation.

Step S2412. Based on the second target sub-feature, determine the fourth target sub-loss value.

Here, the above-mentioned step S2412 corresponds to the above-mentioned step S14, and the implementation of the above-mentioned step S14 can be referred to for implementation.

Step S2413: Determine the first target loss value based on the third target sub-loss value and the fourth target sub-loss value.

Here, the first target loss value may include but not limited to the sum between the third target sub-loss value and the fourth target sub-loss value, the sum after weighting the third target sub-loss value and the fourth target sub-loss value, etc. . During implementation, those skilled in the art may determine the first target loss value according to actual needs, which is not limited by the embodiments of the present disclosure.

Step S242: Determine a second target loss value based on the first sub-feature and the second sub-feature.

Here, the second target loss value may include but not limited to at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.

Step S243: Determine a target loss value based on the first target loss value and the second target loss value.

Here, the target loss value may include, but not limited to, the sum of the first target loss value and the second target loss value, the sum after weighting the first target loss value and the second target loss value respectively, and the like. During implementation, those skilled in the art may determine the target loss value according to actual needs, which is not limited by the embodiments of the present disclosure.

In an embodiment of the present disclosure, the target loss value is determined based on the first sub-feature, the second sub-feature, the first target sub-feature and the second target sub-feature. In this way, the accuracy of the target loss value can be improved, so as to accurately judge whether the first model is converged.

In some implementations, the first network includes a first subnet and a second subnet, and step S22 includes steps S221 to S222, wherein:

Step S221. Using the first sub-network of the first model to be trained, perform feature extraction on the first sub-image and the second sub-image respectively, to obtain the third sub-feature corresponding to the first sub-image and the third sub-feature corresponding to the second sub-image. Four features.

Here, the first network includes at least a first subnetwork, and the first subnetwork is used to extract features of the image using a feature extractor. The feature extractor may include, but is not limited to, RNN, CNN, a Transform-based feature extraction network, and the like. During implementation, those skilled in the art may use an appropriate first sub-network in the first model to obtain the third sub-feature according to actual conditions, which is not limited in the embodiments of the present disclosure. For example, a feature of the first sub-image is extracted through the first sub-network, and the feature is determined as a third sub-feature of the first object. Wherein, the third sub-feature may include but not limited to the original feature of the first sub-image and the like.

Step S222, using the second sub-network of the first model, determining the first sub-feature based on the third sub-feature, and determining the second sub-feature based on the fourth sub-feature.

In some implementations, the second sub-network may include an occlusion erasure network, which is used to perform occlusion erasure processing on input features and output unoccluded features. For example, the first sub-feature of the first object is obtained after occlusion and erasure processing is performed on the third sub-feature through the second sub-network. For another example, the second sub-feature of the first object is obtained after the fourth sub-feature is occluded and erased through the second sub-network.

In the embodiment of the present disclosure, the overall network structure of the first model is trained by introducing the object image as noise at the picture level containing the first image sample of the first object, so that the robustness and the improvement of the first model can be enhanced. The performance of the first model can further enable the trained first model to more accurately re-identify objects in images containing object occlusions.

In some embodiments, step S242 includes step S2421 to step S2423, wherein:

Step S2421. Based on the first sub-feature and the second sub-feature, determine a first target sub-loss value.

Here, the first target sub-loss value may include but not limited to at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.

Step S2422. Based on the third sub-feature and the fourth sub-feature, determine a second target sub-loss value.

Here, the second target sub-loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.

Step S2423: Determine a second target loss value based on the first target sub-loss value and the second target sub-loss value.

Here, the second target loss value may include but not limited to the sum between the first target sub-loss value and the second target sub-loss value, the sum after weighting the first target sub-loss value and the second target sub-loss value, etc. . During implementation, those skilled in the art may determine the second target loss value according to actual needs, which is not limited by the embodiments of the present disclosure.

In an embodiment of the present disclosure, the second target loss value is determined based on the first sub-feature, the second sub-feature, the third sub-feature and the fourth sub-feature. In this way, the accuracy of the second target loss value can be improved, so as to accurately judge whether the first model converges.

In some implementations, the first sub-image includes label information, and step S2422 includes steps S251 to S253, wherein:

Step S251. Determine a seventh sub-loss value based on the third sub-feature and label information.

Here, tag information may include, but not limited to, tag values, identifiers, and the like. The seventh sub-loss value may include but not limited to a cross-entropy loss value and the like. In some implementation manners, the seventh sub-loss value can be calculated by the above formula (1-1), and at this time, f _i in the formula (1-1) is the third sub-feature.

Step S252: Determine an eighth sub-loss value based on the fourth sub-feature and label information.

Here, the eighth sub-loss value may include but not limited to a cross-entropy loss value and the like. In some implementation manners, the eighth sub-loss value may be determined according to the above formula (1-1), at this time, f _i in the formula (1-1) is the fourth sub-feature.

Step S253: Determine a second target sub-loss value based on the seventh sub-loss value and the eighth sub-loss value.

Here, the second target sub-loss value may include, but not limited to, the sum between the seventh sub-loss value and the eighth sub-loss value, the sum after weighting the seventh sub-loss value and the eighth sub-loss value, and the like. During implementation, those skilled in the art may determine the second target sub-loss value according to actual requirements, which is not limited in the embodiments of the present disclosure.

In an embodiment of the present disclosure, the second target sub-loss value is determined based on the third sub-feature, the fourth sub-feature and label information. In this way, the accuracy of the second target sub-loss value can be improved, so as to accurately judge whether the first model is converged.

In some implementations, the second subnetwork includes a third subnetwork and a fourth subnetwork, and step S222 includes steps S2221 to S2222, wherein:

Step S2221, using the third sub-network of the first model to determine the first occlusion score based on the third sub-feature, and determine the second occlusion score based on the fourth sub-feature.

Here, the second sub-network includes at least a third sub-network, and the third sub-network is used to perform semantic analysis based on features of the image to obtain an occlusion score corresponding to the image.

In some embodiments, the third subnetwork includes a pooling subnetwork and at least one occlusion erasure subnetwork, the first occlusion score includes at least one first occlusion subscore, and the second occlusion score includes at least one second occlusion subscore; The above step S2221 includes step 261 to step S262, wherein:

Step S261. Divide the third sub-feature into at least one third sub-part feature by using the pooling sub-network, and divide the fourth sub-feature into at least one fourth sub-part feature.

Here, the pooling sub-network is used to divide the input feature to obtain at least one sub-part feature of the feature. The number of third sub-section features may be the same as the number of first sub-images. For example, if the first sub-image is divided into four parts, then the third sub-feature can be divided into three third sub-part features through the pooling sub-network, and each third sub-part feature corresponds to f _i .

Step S262. Using each occlusion erasure sub-network, determine a first occlusion sub-score based on each third sub-part feature, and determine a second occlusion sub-score based on each fourth sub-part feature.

Here, each occlusion erasure sub-network is used to perform semantic analysis on the input feature to obtain the occlusion score of the image corresponding to the feature. In some implementations, each occlusion erasing sub-network consists of two fully connected layers, a layer normalization and an activation function, wherein the layer normalization is located between the two fully connected layers, and the activation function is located at the end . In some embodiments, the activation function can be a sigmoid function. In some embodiments, the number of occlusion erasure sub-networks is the same as the number of first sub-image divisions. For example, the first sub-image is divided into four parts, and the corresponding feature of each part is f _i . At this time, the third sub-network includes four occlusion-erasing sub-networks, and each occlusion-erasing sub-network is used to output the corresponding Occlusion score. For another example, the first sub-image is divided into five parts, and the corresponding feature of each part is f _i . At this time, the third sub-network includes five occlusion-erasing sub-networks, and each occlusion-erasing sub-module is used to output f _i The corresponding occlusion score.

In some implementations, the occlusion score can be calculated by the following formula (2-2):

s _i =Sigmoid(W _rg LN(W _cp f _i )) (2-2);

where W _cp is a matrix,

W _rg is a matrix,

LN is layer normalization, c represents the channel dimension, and _fi represents the feature of the i-th part in the third sub-feature or the fourth sub-feature.

For example, the third sub-feature is divided into four third sub-part features through the pooling sub-network, and each third sub-part feature is input into the corresponding occlusion erasure sub-network, based on the first fully connected layer W _cp Compress the channel dimension to a quarter of the original, and perform layer normalization on the features of the compressed channel dimension, then compress the layer normalized features to one dimension, and finally output the third sub-part feature correspondence through the Sigmoid function The first occlusion sub-score s _i of .

Step S2222. Using the fourth sub-network, determine the first sub-feature based on the third sub-feature and the first occlusion score, and determine the second sub-feature based on the fourth sub-feature and the second occlusion score.

Here, the second subnetwork further includes a fourth subnetwork, and the fourth subnetwork is used to determine features after occlusion erasure.

In some embodiments, step S2222 includes step S271 to step 272, wherein:

Step S271, using the fourth sub-network, based on each third sub-part feature of the third sub-feature and each first occlusion sub-score, determine the first sub-part feature, and based on each fourth sub-part feature of the fourth sub-feature The partial feature and each second occlusion sub-score determine a second sub-part feature.

In some embodiments, the first sub-part feature or the second sub-part feature can be calculated by the following formula (2-3):

f _i '=s _i f _i (2-3);

where _si denotes the i-th occlusion score, _fi denotes the i-th third sub-part feature or fourth sub-part feature.

In some implementations, the second feature memory may be updated based on the first sub-feature. The way of updating may include, but not limited to, adding the first sub-feature to the second feature storage, replacing a certain feature in the second feature storage with the first sub-feature, and so on.

Step S272: Determine the first sub-feature based on each first sub-part feature, and determine the second sub-feature based on each second sub-part feature.

In some embodiments, the first sub-features can be obtained by concatenating at least one first sub-feature.

In the embodiments of the present disclosure, the accuracy of the first sub-feature and the second sub-feature can be improved by using the pooling sub-network, at least one occlusion-erasing sub-network and the fourth sub-network.

In some embodiments, the first sub-image includes label information, the first model includes a second feature memory, and the second feature memory includes at least one feature belonging to at least one object, and the above step S2421 includes steps S281 to S285, in:

Step S281. Determine an occlusion mask based on the first sub-image and the second sub-image.

Here, the above-mentioned step S281 corresponds to the above-mentioned step S213, and the implementation manner of the above-mentioned step S213 can be referred to for implementation.

Step S282. Determine a third loss value based on the first occlusion score, the second occlusion score and the occlusion mask.

Here, the third loss value may include, but not limited to, a mean square error loss value and the like.

Step S283: Determine a fourth loss value based on the first sub-feature, the second sub-feature and label information.

Here, the fourth loss value may include but not limited to a cross-entropy loss value and the like.

Step S284: Determine a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature memory.

Here, the fifth loss value may include, but not limited to, a comparison loss value and the like.

Step S285, based on the third loss value, the fourth loss value and the fifth loss value, determine the first target sub-loss value.

Here, the first target sub-loss value may include but not limited to the sum of the third loss value, the fourth loss value and the fifth loss value, after weighting the third loss value, the fourth loss value and the fifth loss value respectively and so on. During implementation, those skilled in the art may determine the first target sub-loss value according to actual needs, which is not limited in the embodiments of the present disclosure.

In the embodiments of the present disclosure, the first target sub-loss value is determined based on the occlusion mask, the first sub-feature, the second sub-feature, label information and other object characteristics. In this way, the accuracy of the first target sub-loss value can be improved, so as to accurately judge whether the first model is converged.

In some embodiments, step S282 includes step S2821 to step S2823, wherein:

Step S2821: Determine a first sub-loss value based on the first occlusion score and the occlusion mask.

Here, the first sub-loss value may include, but not limited to, a mean square error loss value and the like. In some implementations, the first sub-loss value can be calculated according to the following formula (2-4):

where N is the total number of occlusion erasure sub-networks, s _i represents the i-th occlusion score, and mask _i represents the i-th occlusion sub-mask in the occlusion mask. For example, in the case where the occlusion mask mask is 1110, mask ₁ is 1 and mask ₄ is 0 at this time.

Step S2822: Determine a second sub-loss value based on the second occlusion score and the occlusion mask.

Here, the second sub-loss value may include, but not limited to, a mean square error loss value and the like. During implementation, the manner of determining the second sub-loss value may be the same as that of determining the first sub-loss value, see step S2821 for details.

Step S2823: Determine a third loss value based on the first sub-loss value and the second sub-loss value.

Here, the third loss value may include, but not limited to, the sum of the first sub-loss value and the second sub-loss value, the sum after weighting the first sub-loss value and the second sub-loss value, and the like. During implementation, those skilled in the art may determine the third loss value according to actual requirements, which is not limited in the embodiments of the present disclosure.

In an embodiment of the present disclosure, the third loss value is determined based on the first occlusion score, the second occlusion score and the occlusion mask. In this way, the accuracy of the third loss value can be improved, so as to accurately judge whether the first model is converged.

In some embodiments, step S283 includes step S2831 to step S2833, wherein:

Step S2831. Determine a third sub-loss value based on the first sub-feature and label information.

Here, tag information may include, but not limited to, tag values, identifiers, and the like. The third sub-loss value may include, but not limited to, a cross-entropy loss value and the like. In some implementation manners, the third sub-loss value can be calculated by the above formula (1-1), at this time, f _i in the formula (1-1) is the first sub-feature.

Step S2832. Determine a fourth sub-loss value based on the second sub-feature and label information.

Here, the fourth sub-loss value may include but not limited to a cross-entropy loss value and the like. In some implementation manners, the fourth sub-loss value can be calculated by the above formula (1-1), and at this time, f _i in the formula (1-1) is the second sub-feature.

Step S2833: Determine a fourth loss value based on the third sub-loss value and the fourth sub-loss value.

Here, the fourth loss value may include, but not limited to, the sum between the third sub-loss value and the fourth sub-loss value, the sum after weighting the third sub-loss value and the fourth sub-loss value, and the like. During implementation, those skilled in the art may determine the fourth loss value according to actual requirements, which is not limited in the embodiments of the present disclosure.

In an embodiment of the present disclosure, the fourth loss value is determined based on the first sub-feature, the second sub-feature and label information. In this way, the accuracy of the fourth loss value can be improved, so as to accurately judge whether the first model is converged.

In some embodiments, step S284 includes step S2841 to step S2844, wherein:

Step S2841. From at least one feature of at least one object in the second feature memory, determine a third feature center of the first object and a fourth feature center of at least one second object.

Here, at least one feature of the first object and at least one feature of at least one second object are stored in the second feature storage. In some implementations, the third feature center may be determined based on the feature of the first object in the second feature memory library and the first sub-feature. Each fourth feature center may be determined based on each feature of each second object in the second feature memory. In some embodiments, the feature center of each object can be calculated by the following formula (2-5):

Among them, c _x represents the feature center of the x-th object, B _k represents the feature set belonging to the k-th object in the mini-batch, m is the set update momentum coefficient, and f _i ′ is the first subclass of the i-th sample. feature. In some embodiments, m can be 0.2.

Step S2842, based on the first sub-feature, the third feature center and each fourth feature center, determine the fifth sub-loss value.

Here, the fifth sub-loss value may include but not limited to contrastive loss and the like. In some implementations, the fifth sub-loss value can be calculated by the following formula (2-6):

Among them, τ is a predefined temperature parameter, c _y represents the third feature center of the y-th object, c _z represents the z-th fourth feature center, f _i represents the first sub-feature of the i-th object, and ID _S represents The total number of objects in the training set.

Step S2843, based on the second sub-feature, the third feature center and each fourth feature center, determine the sixth sub-loss value.

Here, the sixth sub-loss value may include but not limited to contrastive loss and the like. During implementation, the manner of determining the sixth sub-loss value may be the same as that of determining the fifth sub-loss value, see step S2842 for details.

Step S2844: Determine a sixth loss value based on the fifth sub-loss value and the sixth sub-loss value.

Here, the sixth loss value may include, but not limited to, the sum between the fifth sub-loss value and the sixth sub-loss value, the sum after weighting the fifth sub-loss value and the sixth sub-loss value, and the like. During implementation, those skilled in the art may determine the sixth loss value according to actual needs, which is not limited in the embodiments of the present disclosure.

In an embodiment of the present disclosure, the sixth loss value is determined based on the first sub-feature, the second sub-feature and other object characteristics. In this way, the accuracy of the sixth loss value can be improved, so as to accurately judge whether the first model is converged.

In some embodiments, the second network includes a fifth subnetwork and a sixth subnetwork, and step S23 includes steps S231 to S232, wherein:

Step S231, using the fifth sub-network to aggregate the first sub-feature and the second sub-feature with the second feature of at least one second object respectively, to obtain the first aggregated sub-feature and the second sub-feature corresponding to the first sub-feature The corresponding second aggregate subfeature.

Here, the second network includes at least a fifth sub-network, and the fifth sub-network is used to aggregate the first sub-features with the second features of at least one second object to obtain the first aggregated sub-features, and combine the second sub-features with A second feature of at least one second object is aggregated to obtain a second aggregated sub-feature.

Step S232. Using the sixth sub-network, determine the first target sub-feature based on the first aggregated sub-feature, and determine the second target sub-feature based on the second aggregated sub-feature.

Here, the second network further includes a sixth sub-network for determining the first target sub-feature based on the first aggregated sub-feature, and determining the second target sub-feature based on the second aggregated sub-feature.

In the embodiment of the present disclosure, the overall network structure of the first model is trained by introducing the features of the second object as noise at the feature level of the first image sample containing the first object, so that the robustness of the first model can be enhanced and improve the performance of the first model, thereby enabling the trained first model to more accurately re-identify objects in images containing multiple objects.

In some embodiments, step S231 includes step S2311 to step S2314, wherein:

Step S2311, based on the first sub-feature and each second feature, determine a first attention matrix.

Here, the first attention matrix is used to represent the degree of association between the first sub-feature and each second feature. In some embodiments, based on the first sub-feature, X second features belonging to at least one second object are determined, where X is a positive integer. In some embodiments, X can be 10. In some embodiments, based on the K-nearest neighbor algorithm, the X second features closest to the first sub-features belonging to the second object can be searched in the second feature memory library, and based on each second feature, X first sub-features can be determined. center

When looking up, it can be calculated based on the cosine distance between features.

In some implementations, the network parameters of the fifth sub-network include a first prediction matrix and a second prediction matrix, and step S2311 includes steps S2321 to S2323, wherein:

Step S2321, based on the first sub-feature and the first prediction matrix, determine the first prediction feature.

In some embodiments, the first predictive feature can be calculated by the following formula (2-7):

f _q =f'W ₁ (2-7);

Among them, f' represents the first sub-feature,

Both d and d' are the feature dimensions of f'.

Step S2322. Based on each second feature and the second predictive matrix, determine a second predictive feature.

In some embodiments, the second predictive feature can be calculated by the following formula (2-8):

in,

Indicates the i-th first center, i∈1, 2, ... X,

Both d and d' are feature dimensions of the first sub-feature.

Step S2323: Determine a first attention matrix based on the first predictive feature and each second predictive feature.

In some embodiments, the first attention matrix can be determined by the following formula (2-9):

in

Among them, X represents the total number of second features, i∈1, 2, ... X,

is a scaling factor.

Step S2312, based on each second feature and each first attention matrix, determine the first aggregation sub-feature.

In some implementations, the network parameters of the fifth sub-network also include a third prediction matrix, and step S2312 includes steps S2331 to S2332, wherein:

Step S2331. Based on each second feature and the third predictive matrix, determine a third predictive feature.

In some embodiments, the third predictive feature can be calculated by the following formula (2-10):

in,

Indicates the i-th first center, i∈1, 2, ... X,

Both d and d' are feature dimensions of the first sub-feature.

Step S2332, based on each third predictive feature and each first attention matrix, determine the first aggregation sub-feature.

In some embodiments, the first aggregation sub-feature can be determined by the following formula (2-11):

Among them, m _i represents the i-th first attention matrix, and f _vi represents the i-th third predictive feature.

Step S2313: Determine a second attention matrix based on the second sub-features and each second feature.

Here, the second attention matrix is used to characterize the degree of association between the second sub-features and each second feature. During implementation, the manner of determining the second attention matrix may be the same as that of determining the first attention matrix, see step S2321 to step S2323.

Step S2314, based on each second feature and each second attention matrix, determine a second aggregation sub-feature.

Here, the manner of determining the second aggregation sub-feature may be the same as that of determining the first aggregation sub-feature, see step S2331 to step S2332 for details.

In the embodiment of the present disclosure, each first center is divided into multiple parts by multi-head operation, and attention weight is assigned to each part, so as to ensure that more unique patterns similar to target objects and non-target objects can be aggregated to The robustness of the first model is enhanced, so that the trained first model can more accurately re-identify objects in images containing multiple objects.

In some embodiments, the sixth subnetwork includes the seventh subnetwork and the eighth subnetwork, and the above step S232 includes steps S2341 to S2343, wherein:

Step S2341. Determine an occlusion mask based on the first sub-image and the second sub-image.

Here, the occlusion mask is used to represent the occlusion information of the image. In some implementations, the occlusion mask may be determined based on pixel differences between the first sub-image and the second sub-image.

Step S2342. Using the seventh sub-network, determine the fifth sub-feature based on the first aggregation sub-feature and the occlusion mask, and determine the sixth sub-feature based on the second aggregation sub-feature and the occlusion mask.

Here, the seventh sub-network may be an FFN ₁ (·) neural network including two fully connected layers and an activation function. In some embodiments, the fifth sub-feature or the sixth sub-feature can be obtained by the following formula (2-12):

f″=mask·FFN ₁ (f _d ) (2-12);

where mask is the occlusion mask and f _d is the first aggregated sub-feature or the second aggregated sub-feature.

Step S2343. Using the eighth sub-network, determine the first target sub-feature based on the first sub-feature and the fifth sub-feature, and determine the second target sub-feature based on the second sub-feature and the sixth sub-feature.

Here, the eighth sub-network may be an FFN ₂ (·) neural network including two fully connected layers and an activation function. In some embodiments, the first target sub-feature or the second target sub-feature can be obtained by the following formula (2-13):

f _d '=FFN ₂ (f"+f') (2-13);

Wherein, f" is the fifth sub-feature or the sixth sub-feature, and f' is the first sub-feature or the second sub-feature.

In the embodiment of the present disclosure, based on the occlusion mask, the first sub-feature and the first aggregation sub-feature, the target feature is obtained, which can ensure that the features of other objects are only added to the human body part of the first object and not the pre-identified object occlusion part , in order to better simulate the features of multi-pedestrian images.

Fig. 3 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 3, the method includes steps S31 to S37, wherein:

Step S31 , acquiring a first image sample including a first object.

Step S32 , using the first network of the first model to be trained, to perform feature extraction on the first image sample to obtain the first feature of the first object.

Step S33, using the second network of the first model to update the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, and each second object is related to the first object The similarity of is not less than the first threshold.

Step S34: Determine a target loss value based on the first target feature.

Step S35 , based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.

Here, the above-mentioned steps S31 to S35 correspond to the above-mentioned steps S11 to S15 respectively, and for implementation, reference may be made to the specific implementation manners of the above-mentioned steps S11 to S15.

Step S36: Determine an initial second model based on the trained first model.

Here, the network of the trained first model may be adjusted according to an actual usage scenario, and the adjusted first model may be determined as the initial second model. In some embodiments, the first model includes a first network and a second network, the second network in the trained first model can be removed, and the first network of the first model can be adjusted according to the actual scene , and determine the adjusted first model as the initial second model.

Step S37 , based on at least one second image sample, update the model parameters of the second model to obtain a trained second model.

Here, the second image sample may have label information, or may not have label information. During implementation, those skilled in the art may determine a suitable second image sample according to an actual application scenario, which is not limited here. In some implementations, based on at least one second image sample, fine-tuning training may be performed on model parameters of the second model to obtain a trained second model.

In an embodiment of the present disclosure, an initial second model is determined based on the trained first model, and model parameters of the second model are updated based on at least one second image sample to obtain a trained second model. In this way, the model parameters of the trained first model can be migrated to the second model to be applicable to various application scenarios, which can not only reduce the amount of calculation in practical applications, but also improve the training efficiency and training efficiency of the second model. After the detection accuracy of the second model.

Fig. 4 is an image recognition method provided by an embodiment of the present disclosure. As shown in Fig. 4, the method includes steps S41 to S42, wherein:

Step S41, acquiring a first image and a second image.

Here, the first image and the second image may be any suitable images to be recognized. During implementation, those skilled in the art may select an appropriate image according to an actual application scenario, which is not limited by the embodiments of the present disclosure. In some implementations, the first image may include an occluded image or an unoccluded image. In some embodiments, the sources of the first image and the second image may be the same or different. For example, both the first image and the second image are images captured by a camera. For another example, the first image is an image captured by a camera, and the second image may be a frame of an image in a video.

Step S42 , using the trained target model, to recognize the object in the first image and the object in the second image, and obtain a recognition result.

Here, the trained target model may include but not limited to at least one of the first model and the second model. The recognition result indicates that the object in the first image and the object in the second image are the same object or different objects. In some implementations, based on the target model, the first target feature corresponding to the first image and the second target feature corresponding to the second image are obtained respectively, and based on the similarity between the first target feature and the second target feature, it is obtained The recognition result.

In the embodiment of the present disclosure, since the model training method in the above embodiment can introduce real noise at the feature level, or introduce real noise at both the picture level and the feature level, the overall network structure of the target model is trained, and the target model is enhanced. The robustness of the target model has effectively improved the performance of the target model. Therefore, based on the first model and/or the second model obtained by using the model training method in the above embodiment to identify the image, the pedestrian can be more accurately identified. Re-identify.

FIG. 5A is a schematic diagram of the composition and structure of a model training system 50 provided by an embodiment of the present disclosure. As shown in FIG. 54 and feature memory part 55, wherein:

The augmentation part 51 is configured to at least perform occlusion processing on the first sub-image containing the first object to obtain the second sub-image.

The occlusion erasing part 52 is configured to use the first network of the first model to be trained to perform feature extraction on the first sub-image, obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image, Get the second subfeature of the first object.

The feature diffusion part 53 is configured to use the second network of the first model to update the first sub-feature and the second sub-feature respectively based on the second feature of at least one second object, and obtain the first sub-feature corresponding to the first sub-feature A target sub-feature and a second target sub-feature corresponding to the second sub-feature, the similarity between each second object and the first object is not less than the first threshold.

The updating part 54 is configured to determine a target loss value based on the first target sub-feature and the second target sub-feature; based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.

The feature memory part 55 is configured to store at least one feature of at least one object.

In some embodiments, the feature memory part 55 includes a first feature memory and a second feature memory, the first feature memory is used to store the first sub-feature of at least one object, and the second feature memory is used to store at least The first target subfeature of an object.

FIG. 5B is a schematic diagram of a model training system 500 provided by an embodiment of the present disclosure. As shown in FIG. 5B , the model training system 500 performs augmentation processing on an input first image 501 to obtain a second image 502, and converts the first image to After an image 501 and a second image 502 are input to the occlusion and erasing part 52, the first sub-feature f1' and the second sub-feature f2' are obtained respectively, and the second feature memory library 552 is updated based on the first sub-feature f1', and the second After a sub-feature f1', a second sub-feature f2' and at least one feature of at least one other object selected from the second feature memory bank 552 are input to the feature diffusion part 53, the first target sub-feature fd1' and the second sub-feature are respectively obtained. The target sub-feature fd2 ′, based on the first target sub-feature fd1 ′, updates the first feature storage 551 , and the network parameters in the occlusion erasing part 52 and the feature diffusion part 53 .

In some implementations, the augmentation part 51 is further configured to: determine an occlusion mask based on the first sub-image and the second sub-image.

Fig. 5C is a schematic diagram of determining an occlusion mask provided by an embodiment of the present disclosure. As shown in Fig. 5C, a pixel comparison operation 503 is performed between the first sub-image 501 and the second sub-image 502, and after the pixel comparison operation 503 , perform a binarization operation 504 on the comparison result, and obtain a corresponding occlusion mask 505 after the binarization operation 504 .

In some implementations, the first network includes a first sub-network and a second sub-network, and the occlusion erasing part 52 is further configured to: use the first sub-network of the first model to be trained to respectively perform the first sub-image Perform feature extraction with the second sub-image to obtain the third sub-feature corresponding to the first sub-image and the fourth sub-feature corresponding to the second sub-image; use the second sub-network of the first model to determine the first sub-feature based on the third sub-feature sub-features, and determine the second sub-features based on the fourth sub-features.

FIG. 5D is a schematic diagram of a first network 510 provided by an embodiment of the present disclosure. As shown in FIG. 5D , the first network 510 includes a first sub-network 511 and a second sub-network 512. The sub-image 502 is input into the first sub-network 511 to obtain the third sub-feature f1 corresponding to the first sub-image 501, the fourth sub-feature f2 corresponding to the second sub-image 502, and the third sub-feature f1 and the fourth sub-feature f2 is input into the second sub-network 512 to obtain the first sub-feature f1' and the second sub-feature f2'.

In some embodiments, the second subnetwork includes a third subnetwork and a fourth subnetwork, and the occlusion erasing part 52 is further configured to: use the third subnetwork of the first model to determine the first occlusion score, and determine the second occlusion score based on the fourth sub-feature; utilize the fourth sub-network, based on the third sub-feature and the first occlusion score, determine the first sub-feature, and based on the fourth sub-feature and the second occlusion score, Determine the second sub-feature.

FIG. 5E is a schematic diagram of a second subnetwork 512 provided by an embodiment of the present disclosure. As shown in FIG. 5E, the second subnetwork 512 includes a third subnetwork 521 and a fourth subnetwork 522, and the third subnetwork f1 and The fourth sub-feature f2 is input into the third sub-network 521, and the first occlusion score s1 corresponding to the third sub-feature f1 and the second occlusion score s2 corresponding to the fourth feature f2 are respectively obtained, and the first occlusion score s1 and the second occlusion score The three sub-features f1 are input to the fourth sub-network 522 to obtain the first sub-feature f1', and the second occlusion score s2 and the fourth sub-feature f2 are input to the fourth sub-network 522 to obtain the second sub-feature f2'.

In some embodiments, the second network includes a fifth sub-network and a sixth sub-network, and the feature diffusion part 53 is further configured to: use the fifth sub-network to combine the first sub-feature and the second sub-feature with at least one The second feature of the second object is aggregated to obtain the first aggregated sub-feature corresponding to the first sub-feature and the second aggregated sub-feature corresponding to the second sub-feature; the sixth sub-network is used to determine the first aggregated sub-feature based on the first aggregated sub-feature target sub-features, and determine second target sub-features based on the second aggregated sub-features.

FIG. 5F is a schematic diagram of a second network 520 provided by an embodiment of the present disclosure. As shown in FIG. 5F, the second network 520 includes a fifth sub-network 521 and a sixth sub-network 522, and the first sub-feature f1' is input to In the case of the fifth sub-network 521, the fifth sub-network 521 searches the second feature storage 552 for K nearest first centers belonging to the second object based on the first sub-feature f1′

Based on the first sub-feature f1′ and the first prediction matrix W ₁ , determine the first prediction feature f _q , based on the first center

and the second prediction matrix W ₂ , determine the second prediction feature f _c , based on the first center

and the third prediction matrix W ₃ , to determine the third prediction feature f _v . The first attention matrix m _i is determined based on the first prediction feature f _q and the second prediction feature f _c , and the first aggregation sub-feature f _d is determined based on the first attention matrix m _i and the third prediction feature f _v . Input the first aggregation sub-feature f _d into FFN ₁ (·) to obtain the fifth feature f", and input the weighted first sub-feature f1' and the fifth feature f" into the sixth sub-network 522 to obtain the fifth feature f" A target sub-feature f _d ′.

In some implementations, the feature diffusion part 53 is further configured to: determine a first attention matrix based on the first sub-feature and each second feature, and the first attention matrix is used to characterize the first sub-feature and each second feature The degree of association between the second features; based on each second feature and each first attention matrix, determine the first aggregation sub-feature; based on the second sub-feature and each second feature, determine the second attention matrix, The second attention matrix is used to characterize the degree of association between the second sub-features and each second feature; based on each second feature and each second attention matrix, the second aggregated sub-features are determined.

In some embodiments, the network parameters of the fifth sub-network include a first prediction matrix and a second prediction matrix, and the feature diffusion part 53 is further configured to: determine the first prediction feature based on the first sub-feature and the first prediction matrix ; Based on each second feature and the second predictive matrix, determine a second predictive feature; determine the first attention matrix based on the first predictive feature and each second predictive feature.

In some embodiments, the network parameters of the fifth sub-network include a third predictive matrix, and the feature diffusion part 53 is further configured to: determine a third predictive feature based on each second feature and the third predictive matrix; The third predictive feature and each of the first attention matrices determine a first aggregated sub-feature.

In some embodiments, the sixth sub-network includes a seventh sub-network and an eighth sub-network, and the feature diffusion part 53 is further configured to: use the seventh sub-network to determine the fifth sub-network based on the first aggregation sub-feature and the occlusion mask Sub-features, and determine the sixth sub-feature based on the second aggregated sub-feature and the occlusion mask; use the eighth sub-network, based on the first sub-feature and the fifth sub-feature, determine the first target sub-feature, and based on the second sub-feature and a sixth sub-feature to determine the second target sub-feature.

In some implementations, the updating part 54 is further configured to: determine the first target loss value based on the first target sub-feature and the second target sub-feature; determine the second target loss value based on the first sub-feature and the second sub-feature Loss value; determining the target loss value based on the first target loss value and the second target loss value; based on the target loss value, updating the model parameters of the first model at least once to obtain the trained first model.

In some implementations, the updating part 54 is further configured to: update the model parameters of the first model when the target loss value does not meet the preset condition, to obtain the updated first model, based on the updated The first model is to determine the first model after training; if the target loss value satisfies the preset condition, determine the updated first model as the first model after training.

In some embodiments, the updating part 54 is further configured to: determine the first target sub-loss value based on the first sub-feature and the second sub-feature; determine the second target sub-loss value based on the third sub-feature and the fourth sub-feature Loss value: determining a second target loss value based on the first target sub-loss value and the second target sub-loss value.

In some embodiments, the first sub-image includes label information, the first model includes a second feature memory bank, the second feature memory bank includes at least one feature belonging to at least one object, and the updating part 54 is further configured to: based on Determine the third loss value based on the first occlusion score, the second occlusion score and the occlusion mask; determine the fourth loss value based on the first sub-feature, the second sub-feature and label information; based on the first sub-feature, the second sub-feature and at least one feature of at least one object in the second feature memory bank to determine a fifth loss value; based on the third loss value, the fourth loss value and the fifth loss value, determine the first target sub-loss value.

In some implementations, the updating part 54 is further configured to: determine the first sub-loss value based on the first occlusion score and the occlusion mask; determine the second sub-loss value based on the second occlusion score and the occlusion mask; The first sub-loss value and the second sub-loss value determine the third loss value.

In some implementations, the updating part 54 is further configured to: determine a third sub-loss value based on the first sub-feature and label information; determine a fourth sub-loss value based on the second sub-feature and label information; The sub-loss value and the fourth sub-loss value determine the fourth loss value.

In some embodiments, the updating part 54 is further configured to: determine the third feature center of the first object and the fourth feature center of at least one second object from at least one feature of at least one object in the second feature memory library. feature center; based on the first sub-feature, the third feature center and each fourth feature center, determine the fifth sub-loss value; based on the second sub-feature, the third feature center and each fourth feature center, determine the sixth sub- A loss value; based on the fifth sub-loss value and the sixth sub-loss value, a fifth loss value is determined.

In some implementations, the updating part 54 is further configured to: determine the seventh sub-loss value based on the third sub-feature and label information; determine the eighth sub-loss value based on the fourth sub-feature and label information; The sub-loss value and the eighth sub-loss value determine the second target sub-loss value.

FIG. 5G is a schematic diagram of obtaining a target loss value 540 provided by an embodiment of the present disclosure. As shown in FIG. 5G , the target loss value 540 mainly includes three parts: feature extraction, occlusion erasing part 52 and feature diffusion part 53. loss value, where:

The loss values for this part of feature extraction include:

The seventh loss value Loss7 determined based on the third sub-feature f1 and the label information of the first sub-image 501, the eighth loss value Loss8 determined based on the fourth sub-feature f2 and the label information of the first sub-image 501;

The loss values for this part of the occlusion erasure part 52 include:

The first sub-loss value Loss31 determined based on the occlusion mask 541 and the first occlusion score s1, the second sub-loss value Loss32 determined based on the occlusion mask 541 and the second occlusion score s2;

The third sub-loss value Loss41 determined based on the label information of the first sub-feature f1' and the first sub-image 501, the fourth sub-loss value Loss42 determined based on the label information of the second sub-feature f2' and the first sub-image 501;

The fifth sub-loss value Loss51 determined based on the first sub-feature f1' and the second feature memory bank 552, the sixth sub-loss value Loss52 determined based on the second sub-feature f2' and the second feature memory bank 552;

The loss values for this part of the characteristic diffusion part 53 include:

The ninth sub-loss value Loss11 (corresponding to the above-mentioned first loss value) determined based on the label information of the first target sub-feature fd1' and the first sub-image 501, based on the second target sub-feature fd2' and the first sub-image 501 The tenth sub-loss value Loss12 determined by the label information (corresponding to the above-mentioned first loss value);

The eleventh sub-loss value Loss21 (corresponding to the above-mentioned second loss value) determined based on the first target sub-feature fd1' and the first feature memory bank 551 is determined based on the second target sub-feature fd2' and the first feature memory bank 551 The twelfth sub-loss value Loss22 (corresponding to the above-mentioned second loss value).

In some embodiments, the model training system further includes: a second determination part and a third determination part; the second determination part is configured to determine an initial second model based on the trained first model; the third determination part, It is configured to update the model parameters of the second model based on at least one second image sample to obtain the trained second model.

Compared with the method in the related art, the method provided by the embodiment of the present disclosure has at least the following improvements:

1) In related technologies, pedestrian re-identification (re-identification, ReID) modeling is mainly based on pose estimation algorithms or human body analysis algorithms for auxiliary training. However, the modeling of pedestrian re-identification in the embodiment of the present disclosure uses deep learning to perform occluded pedestrian re-identification.

2) In related technologies, in the modeling process of pedestrian re-identification, the robustness of the model to occlusions is mainly enhanced based on random erasing, focusing on improving the model's resistance to non-pedestrian occlusions (NPO). Robustness, while ignoring the feature interference from Non-Target Pedestrians (NTP). In the process of pedestrian re-identification modeling in the embodiment of the present disclosure, a feature erasing and diffusion network (Feature Erasing and Diffusion Network, FED) is proposed to simultaneously process NPO and NTP, specifically, based on occlusion erasing The module (Erasing Module, OEM) eliminates NPO features, supplemented by NPO augmentation strategy to simulate NPO on the overall pedestrian image, and generates accurate occlusion masks. Then, based on the feature diffusion module (Feature Diffusion Module, FDM), the pedestrian features and other memory features are diffused to synthesize NTP features in the feature space, which realizes the simulation of NPO occlusion interference at the image level and NTP interference at the feature level. Greatly improve the model's perception of target pedestrians (Target Pedestrians, TP), and reduce the impact of NPO and NTP.

The method provided by the embodiments of the present disclosure has at least the following beneficial effects: 1) Make full use of the occlusion information of the picture and the characteristics of other pedestrians to simulate the interference of non-pedestrian occlusion and non-target pedestrians, and can better comprehensively analyze various influencing factors, Improve the model's perception of TP; 2) Use deep learning to make the results of pedestrian re-identification more accurate, and improve the accuracy of pedestrian re-identification in real and complex scenes.

In order to better illustrate the beneficial effects of the embodiments of the present disclosure, the experimental data of the methods provided in the embodiments of the present disclosure and methods in the related art will be compared and described below.

(1) Datasets: Occluded-DukeMTMC (O-Duke), Occluded-REID (O-REID), Partial-REID (P-REID) are ReID datasets with occlusion, Market-1501 and The two datasets DukeMTMC-reID are ReID datasets with few occlusions.

(2) Evaluation metrics: To ensure a fair comparison with existing pedestrian ReID methods, all methods are evaluated under Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP). CMC curves are used to evaluate the accuracy of person retrieval. mAP is the average of all mean precisions. All experiments are performed in a single query.

(3) Initialization of some parameters of the model: adjust the input image to 256×128. The first model is trained end-to-end with a stochastic steepest descent (SGD) optimizer with a momentum of 0.9 and a weight decay of le-4. Initialize the learning rate to 0.008 with cosine learning rate decay. For each input branch, the batch size is 64, which contains 16 identities and 4 samples per identity. All experiments are performed on two RTX 1080Ti GPUs. Set the temperature τ in contrastive loss to 0.05 and the number of heads in FDM to 0.8.

For the NPO-enhanced occlusion set, only occluders are cropped from O-Duke's training data and adopted to augment all other datasets. This is because Market-1501 contains few occluded images, while DukeMTMC-reID already contains many occluded data in the training set.

(4) Experimental results

1) Comparison results between the method provided by the embodiment of the present disclosure and the existing method on the ReID dataset with occlusion

Table 1 shows the performance comparison of each pedestrian ReID method on the three data sets of O-Duke, O-REID and P-REID. Since there is no corresponding training set for O-REID and P-REID, the model trained on Market-1501 is used for testing. Pedestrian ReId methods include: Part-based Convonlutional Baseline (PCB), Deep spatial feature reconstruction (DSR), High-Order re-identification (HOReID 27), Part-aware Transformer (Part-aware Transformer, PAT), Transformer-based ReID (Transformer-based Object Re-Identification, TransReID) adopts Vision without sliding window settings

Transformer as the backbone, Transformer-based ViT Baseline, where ViT Baseline outperforms TransReID on O-REID and P-REID datasets, because TransReID uses many dataset-specific markers.

Table 1 Performance comparison of each method on O-Duke, OREID and P-REID datasets

Comparing FED with existing methods, FED achieves the highest Rank-1 and mAP on both O-Duke and O-REID datasets. Especially on the O-REID dataset, it reached 86.3%/79.3% on Rank-1/mAP, surpassing other methods by at least 4.7%/2.6%. On O-Duke, it reaches 68.1%/56.4% on Rank-1/mAP, surpassing other methods by at least 3.6%/0.7%. On the P-REID dataset, the highest mAP accuracy is achieved, reaching 80.5%, which exceeds other methods by 3.9%. Therefore, a good performance is achieved on the occluded ReID dataset.

2) Comparison results between the method provided by the embodiment of the present disclosure and the existing method on the whole person's ReID dataset

Experiments are conducted on ensemble person ReID datasets, including Market-1501 and DukeMTMC-reID. When training on the DukeMTMCreID dataset, the MSE loss is not calculated. This is because there are a large number of NPOs in the training set, and accurate occlusion masks cannot be obtained. The results are shown in Table 2. TransReID has no sliding window setting and the image size is 256×128. TransReID achieves better performance than FED on the whole person dataset. This is because TransReID is specially designed for whole person ReID and encodes camera information during training. In addition, however, FED also achieves 84.9% Rank-1 accuracy on DukeMTMC-reID, surpassing other CNN-based methods and approaching TransReID.

Table 2 Performance comparison of each method on Market-1501 and DukeMTMC-reID datasets

3) Effectiveness of FED

In Table 3, the ablation studies of NPO Augmentation (NPO Aug), OEM and FDM are shown. Numbers 1 to 5 represent baseline, baseline+NPO Aug, baseline+NPO Aug+OEM, baseline+NPO Aug+FDM and FED, respectively. Model 1 uses ViT as the feature extractor and is optimized by cross-entropy loss (ID Loss) and Triplet Loss. By comparing Model 1 (Baseline) with Model 2 (Baseline + NPO Aug), there is a large improvement on Rank-1 of 4.9%, showing that augmented images are real and valuable. By comparing Model 2 (Baseline + NPO Aug) with Model 3 (Baseline + NPO Aug + OEM), OEMs can further improve the representation by removing the underlying NPO information. By comparing model 2 (baseline + NPO Aug) and model 4 (baseline + NPO Aug + FDM), FDM improves Rank-1 and mAP by 1.7% and 2.4%, respectively. This means that optimizing a network with diffusion features can greatly improve the model's perception of TP. In the end, FED achieved the highest accuracy, showing that each component works both individually and together.

Table 3 Validity of FED

4) K-Nearest Neighbor Analysis of Feature Memory

Here, the number of searches K in the feature memory search operation is analyzed. In Table 4, K is set as 2, 4, 6 and 8, and experiments are performed on DukeMTMC-reID, Market-1501 and Occlude-DukeMTMC. The performance of the two overall person ReID datasets, DukeMTMC-reID and Market-1501, is stable on various K, within 0.5%. For Market-1501, NPO and NTP are few, failing to highlight the effectiveness of FDM. For DukeMTMC-reID, a large amount of training data comes with NPO and NTP, and the loss constraints can make the network have high accuracy. For Occluded-DukeMTMC, since all the training data are overall pedestrians, the introduction of FDM can greatly simulate the multi-pedestrian situation in the test set. As K increases, FDM can better preserve the characteristics of TP and introduce realistic noise.

Table 4K nearest neighbor analysis

5) Qualitative analysis of FED

FIG. 5H is a schematic diagram of occlusion scores of pedestrian images provided by an embodiment of the present disclosure. In FIG. 5H , occlusion scores of some pedestrian images from OEMs are shown. Images with NPO and non-target pedestrian NTP are shown. From Fig. 6H, it can be seen that for

graphs

551 and 552 with vertical object occlusion, the occlusion score is hardly affected, because symmetric pedestrians with less than half occlusion are not a critical issue for pedestrian ReID. For

maps

553 and 554 where horizontal occlusion is present, OEMs can accurately identify NPOs and flag them with a small occlusion score. For

maps

555 and 556 with multiple pedestrian image occlusions, the OEM identifies each stripe as valuable. Therefore, the subsequent FDM is crucial to improve the model performance.

6) Examples of retrieval results using feature and distribution representations

FIG. 5I is a schematic diagram of an image retrieval result provided by an embodiment of the present disclosure. As shown in FIG. 5I , it shows the retrieval results of TransReID and FED. Figure 561 and Figure 562 are object occlusion images. It is obvious that FED has a better recognition ability for NPO and can accurately retrieve the target pedestrian. Figure 563 and Figure 564 are multi-pedestrian images, and FED has a stronger perception of TP and achieves higher retrieval accuracy.

Based on the above-mentioned embodiments, an embodiment of the present disclosure provides a model training device. FIG. 6 is a schematic diagram of the composition and structure of a model training device provided by an embodiment of the present disclosure. As shown in FIG. 6 , the model training device 60 includes a first acquisition part 61 , feature extraction part 62 , first update part 63 , first determination part 64 and second update part 65 .

a first acquiring part 61 configured to acquire a first image sample containing a first object;

The feature extraction part 62 is configured to use the first network of the first model to be trained to perform feature extraction on the first image sample to obtain the first feature of the first object;

The first updating part 63 is configured to use the second network of the first model to update the first features respectively based on the second features of at least one second object to obtain the first target features corresponding to the first features, each the similarity between the second object and the first object is not less than a first threshold;

The first determining part 64 is configured to determine a target loss value based on the first target feature;

The second updating part 65 is configured to update the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

In some embodiments, the first image sample includes label information, the first model includes a first feature memory, the first feature memory includes at least one feature belonging to at least one object, and the first determination part 64 is further configured To: determine the first loss value based on the first target feature and label information; determine the second loss value based on the first target feature and at least one feature of at least one object in the first feature memory; based on the first loss value and The second loss value determines the target loss value.

In some embodiments, the first determining part 64 is further configured to: determine the first feature center of the first object and the at least one second object from at least one feature of at least one object in the first feature memory library second feature center; determining a second loss value based on the first target feature, the first feature center, and each second feature center.

In some embodiments, the first feature memory library includes feature sets belonging to at least one object, each feature set includes at least one feature of the object to which it belongs, and the device further includes: a third updating part configured to feature, updating the feature set belonging to the first object in the first feature memory.

In some implementations, the first acquisition part 61 is further configured to: acquire the first sub-image and the second sub-image containing the first object, the second sub-image is an image obtained by at least performing occlusion processing on the first sub-image The feature extraction part 62 is also configured to: use the first network of the first model to be trained to perform feature extraction on the first sub-image, obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image Extract to obtain the second sub-feature of the first object; the first update part 63 is also configured to: use the second network of the first model, based on the second feature of at least one second object, to the first sub-feature and the first sub-feature The two sub-features are updated respectively to obtain the first target sub-feature corresponding to the first sub-feature and the second target sub-feature corresponding to the second sub-feature; the first determining part 64 is also configured to: based on the first target sub-feature and The second target sub-feature determines the target loss value.

In some embodiments, the first determination part 64 is further configured to: determine the first target loss value based on the first target sub-feature and the second target sub-feature; Two target loss values; based on the first target loss value and the second target loss value, determine the target loss value.

In some embodiments, the first acquisition part 61 is further configured to: acquire the first sub-image containing the first object; based on the preset occlusion set, at least perform occlusion processing on the first sub-image to obtain the second sub-image , the occlusion set includes at least one occlusion image.

In some embodiments, the first network includes a first sub-network and a second sub-network, and the feature extraction part 62 is further configured to: use the first sub-network of the first model to be trained to respectively perform the first sub-image and Feature extraction is performed on the second sub-image to obtain the third sub-feature corresponding to the first sub-image and the fourth sub-feature corresponding to the second sub-image; use the second sub-network of the first model to determine the first sub-feature based on the third sub-feature feature, and determine the second sub-feature based on the fourth sub-feature.

In some embodiments, the first determining part 64 is further configured to: determine the first target sub-loss value based on the first sub-feature and the second sub-feature; determine the second target sub-loss value based on the third sub-feature and the fourth sub-feature A target sub-loss value; determining a second target loss value based on the first target sub-loss value and the second target sub-loss value.

In some embodiments, the first sub-image includes label information, and the first determining part 64 is further configured to: determine a seventh sub-loss value based on the third sub-feature and label information; based on the fourth sub-feature and label information, Determine an eighth sub-loss value; determine a second target sub-loss value based on the seventh sub-loss value and the eighth sub-loss value.

In some embodiments, the second sub-network includes a third sub-network and a fourth sub-network, and the feature extraction part 62 is further configured to: use the third sub-network of the first model to determine the first occlusion based on the third sub-feature score, and determine the second occlusion score based on the fourth sub-feature; use the fourth sub-network, based on the third sub-feature and the first occlusion score, determine the first sub-feature, and based on the fourth sub-feature and the second occlusion score, determine Second sub-feature.

In some embodiments, the third subnetwork includes a pooling subnetwork and at least one occlusion erasure subnetwork, the first occlusion score includes at least one first occlusion subscore, the second occlusion score includes at least one second occlusion subscore, The feature extraction part 62 is also configured to: divide the third sub-feature into at least one third sub-part feature by using the pooling sub-network, and divide the fourth sub-feature into at least one fourth sub-part feature; use each The occlusion erasure sub-network determines each first occlusion subscore based on each third subsection feature, and determines each second occlusion subscore based on each fourth subsection feature.

In some embodiments, the feature extraction part 62 is further configured to: use the fourth sub-network to determine the first sub-part feature based on each third sub-part feature and each first occlusion sub-score of the third sub-feature , and based on each fourth sub-part feature and each second occlusion sub-score of the fourth sub-feature, determine the second sub-part feature; based on each first sub-part feature, determine the first sub-feature, and based on each The second sub-part feature, to determine the second sub-feature.

In some embodiments, the first sub-image includes label information, the first model includes a second feature memory bank, the second feature memory bank includes at least one feature belonging to at least one object, and the first determining part 64 is further configured to : Determine the occlusion mask based on the first sub-image and the second sub-image; determine the third loss value based on the first occlusion score, the second occlusion score and the occlusion mask; based on the first sub-feature, the second sub-feature and the label information, determine a fourth loss value; based on at least one feature of at least one object in the first sub-feature, the second sub-feature, and the second feature memory bank, determine the fifth loss value; based on the third loss value, the fourth loss value and the fifth loss value to determine the first target sub-loss value.

In some embodiments, the first determining part 64 is further configured to: divide the first sub-image and the second sub-image into at least one first sub-part image and at least one second sub-part image; An occlusion sub-mask is determined for a sub-partial image and each second sub-partial image; based on each occlusion sub-mask, an occlusion mask is determined.

In some implementations, the first determining part 64 is further configured to: determine the first sub-loss value based on the first occlusion score and the occlusion mask; determine the second sub-loss value based on the second occlusion score and the occlusion mask ; Determine a third loss value based on the first sub-loss value and the second sub-loss value.

In some embodiments, the first determining part 64 is further configured to: determine a third sub-loss value based on the first sub-feature and label information; determine a fourth sub-loss value based on the second sub-feature and label information; The third sub-loss value and the fourth sub-loss value determine the fourth loss value.

In some embodiments, the first determination part 64 is further configured to: determine the third feature center of the first object and the The fourth feature center; based on the first sub-feature, the third feature center and each fourth feature center, determine the fifth sub-loss value; based on the second sub-feature, the third feature center and each fourth feature center, determine the first Six sub-loss values; based on the fifth sub-loss value and the sixth sub-loss value, a fifth loss value is determined.

In some embodiments, the second network includes a fifth sub-network and a sixth sub-network, and the first updating part 63 is further configured to: use the fifth sub-network to combine the first sub-feature and the second sub-feature with at least The second feature of a second object is aggregated to obtain the first aggregated sub-feature corresponding to the first sub-feature and the second aggregated sub-feature corresponding to the second sub-feature; the sixth sub-network is used to determine the first aggregated sub-feature based on the first aggregated sub-feature a target sub-feature, and determine a second target sub-feature based on the second aggregated sub-feature.

In some implementations, the first updating part 63 is further configured to: determine a first attention matrix based on the first sub-feature and each second feature, and the first attention matrix is used to characterize the first sub-feature and each second feature A degree of association between the second features; based on each second feature and each first attention matrix, determine the first aggregation sub-feature; based on the second sub-feature and each second feature, determine the second attention matrix , the second attention matrix is used to characterize the degree of association between the second sub-feature and each second feature; based on each second feature and each second attention matrix, the second aggregation sub-feature is determined.

In some embodiments, the sixth sub-network includes a seventh sub-network and an eighth sub-network, and the first updating part 63 is further configured to: use the seventh sub-network to determine the sixth sub-network based on the first aggregation sub-feature and the occlusion mask Five sub-features, and determine the sixth sub-feature based on the second aggregation sub-feature and the occlusion mask; use the eighth sub-network, based on the first sub-feature and the fifth sub-feature, determine the first target sub-feature, and based on the second sub-feature feature and the sixth sub-feature, determine the second target sub-feature.

Based on the above-mentioned embodiments, an embodiment of the present disclosure provides an image recognition device. FIG. 7 is a schematic diagram of the composition and structure of an image recognition device provided by an embodiment of the present disclosure. As shown in FIG. 7 , the image recognition device 70 includes a second acquisition part 71 and identification part 72.

a second acquiring part 71 configured to acquire the first image and the second image;

The identification part 72 is configured to use the trained target model to identify the object in the first image and the object in the second image to obtain a recognition result, wherein the trained target model includes: the model obtained by using the above-mentioned model training method The first model: the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.

The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.

It should be noted that, in the embodiments of the present disclosure, if the above methods are implemented in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the embodiments of the present disclosure or the part that contributes to the related technology can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to make a An electronic device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, embodiments of the present disclosure are not limited to any specific combination of hardware and software.

An embodiment of the present disclosure provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above method when executing the computer program.

An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the foregoing method is implemented. Computer readable storage media may be transitory or non-transitory.

An embodiment of the present disclosure provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the steps in the above method are implemented. The computer program product can be specifically realized by means of hardware, software or a combination thereof. In one embodiment, the computer program product is embodied as a computer storage medium, and in one embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

It should be noted that FIG. 8 is a schematic diagram of a hardware entity of an electronic device in an embodiment of the present disclosure. As shown in FIG. 8, the hardware entity of the electronic device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein:

The processor 801 generally controls the overall operation of the electronic device 800 . The communication interface 802 can enable the electronic device to communicate with other terminals or servers through the network. The memory 803 is configured to store instructions and applications executable by the processor 801, and can also cache data to be processed or processed by the processor 801 and various modules in the electronic device 800 (for example, image data, audio data, voice communication data and Video communication data) can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data transmission may be performed between the processor 801 , the communication interface 802 and the memory 803 through the bus 804 .

It should be pointed out here that: the descriptions of the above storage medium and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium and device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic related to the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the embodiments of the present disclosure. The implementation process constitutes any limitation. The serial numbers of the above-mentioned embodiments of the present disclosure are for description, and do not represent the advantages and disadvantages of the embodiments. It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are schematic. For example, the division of the units is a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of. The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integrated The unit can be realized in the form of hardware or in the form of hardware plus software functional unit.

Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes: The steps of the foregoing method embodiments; and the aforementioned storage media include: various media that can store program codes such as removable storage devices, read only memory (ROM), magnetic disks or optical disks. Alternatively, if the above-mentioned integrated units of the present disclosure are realized in the form of software function modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present disclosure or the part that contributes to related technologies can be embodied in the form of software products, which are stored in a storage medium and include several instructions to make a An electronic device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.

The above is an embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure, and should cover all within the protection scope of the present disclosure.

Industrial Applicability

Embodiments of the present disclosure provide a model training and image recognition method, device, storage medium, and computer program product. The model training method includes: acquiring a first image sample containing a first object; using the first model to be trained The first network of the first image sample is subjected to feature extraction to obtain the first feature of the first object; the second network of the first model is used to update the first feature based on at least one second feature of the second object, The first target feature corresponding to the first feature is obtained, and the similarity between each second object and the first object is not less than the first threshold; based on the first target feature, the target loss value is determined; based on the target loss value, the first model's The model parameters are updated at least once to obtain the trained first model. The above scheme, on the one hand, can enhance the robustness of the first model and improve the performance of the first model; on the other hand, it can improve the consistency of the prediction of the first model after training for different image samples of the same object, and then can This enables the trained first model to more accurately re-identify objects in images containing multiple objects.

Claims

A model training method, the method comprising:

obtaining a first image sample containing a first object;

performing feature extraction on the first image sample by using the first network of the first model to be trained to obtain the first feature of the first object;

Using the second network of the first model to update the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, each of the second the similarity between the object and the first object is not less than a first threshold;

determining a target loss value based on the first target feature;

Based on the target loss value, the model parameters of the first model are updated at least once to obtain the trained first model.
The method of claim 1, wherein the first image sample includes label information, the first model includes a first feature memory, and the first feature memory includes at least one object belonging to at least one object. Features; the determination of the target loss value based on the first target feature includes:

determining a first loss value based on the first target feature and the tag information;

determining a second loss value based on the first target feature and at least one feature of at least one object in the first feature memory;

The target loss value is determined based on the first loss value and the second loss value.
The method according to claim 2, wherein said determining a second loss value based on said first target feature and at least one feature of at least one object in said first feature memory bank comprises:

determining a first feature center of said first object and a second feature center of at least one said second object from at least one feature of at least one object in said first feature memory;

The second loss value is determined based on the first target feature, the first feature center and each of the second feature centers.
The method according to claim 2 or 3, wherein the first feature memory library includes feature sets belonging to at least one object, and each feature set includes at least one feature of the object to which it belongs; the method further comprises:

Based on the first target feature, a feature set belonging to the first object in the first feature memory is updated.
A method according to any one of claims 1 to 4, wherein,

The obtaining the first image sample containing the first object includes: obtaining a first sub-image and a second sub-image containing the first object, and the second sub-image is an image after at least occlusion processing is performed on the first sub-image ;

Using the first network of the first model to be trained to perform feature extraction on the first image sample to obtain the first feature of the first object includes: using the first network of the first model to be trained, performing feature extraction on the first sub-image to obtain a first sub-feature of the first object, and performing feature extraction on the second sub-image to obtain a second sub-feature of the first object;

Using the second network of the first model to update the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature includes: using the In the second network of the first model, based on the second feature of at least one second object, the first sub-feature and the second sub-feature are respectively updated to obtain the first target corresponding to the first sub-feature the sub-feature and the second target sub-feature corresponding to the second sub-feature;

The determining a target loss value based on the first target feature includes: determining the target loss value based on the first target sub-feature and the second target sub-feature.
The method according to claim 5, wherein said determining said target loss value based on said first target sub-feature and said second target sub-feature comprises:

determining a first target loss value based on the first target sub-feature and the second target sub-feature;

determining a second target loss value based on the first sub-feature and the second sub-feature;

A target loss value is determined based on the first target loss value and the second target loss value.
The method of claim 6, wherein said acquiring a first sub-image and a second sub-image comprising a first object comprises:

obtaining a first sub-image containing a first object;

Based on a preset occlusion set, at least perform occlusion processing on the first sub-image to obtain the second sub-image, the occlusion set includes at least one occlusion image.
The method according to claim 6 or 7, wherein the first network comprises a first sub-network and a second sub-network; the first network using the first model to be trained, for the first sub-image Perform feature extraction to obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image to obtain the second sub-feature of the first object, including:

Use the first sub-network of the first model to be trained to perform feature extraction on the first sub-image and the second sub-image to obtain the third sub-feature corresponding to the first sub-image and the second sub-image. The fourth sub-feature corresponding to the sub-image;

Using the second sub-network of the first model, the first sub-feature is determined based on the third sub-feature, and the second sub-feature is determined based on the fourth sub-feature.
The method according to claim 8, wherein said determining a second target loss value based on said first sub-feature and said second sub-feature comprises:

determining a first target sub-loss value based on the first sub-feature and the second sub-feature;

determining a second target sub-loss value based on the third sub-feature and the fourth sub-feature;

The second target loss value is determined based on the first target sub-loss value and the second target sub-loss value.
The method according to claim 9, wherein the first sub-image includes label information; said determining a second target sub-loss value based on the third sub-feature and the fourth sub-feature comprises:

determining a seventh sub-loss value based on the third sub-feature and the tag information;

determining an eighth sub-loss value based on the fourth sub-feature and the tag information;

The second target sub-loss value is determined based on the seventh sub-loss value and the eighth sub-loss value.
The method according to any one of claims 8 to 10, wherein the second subnetwork includes a third subnetwork and a fourth subnetwork; the second subnetwork utilizing the first model is based on the The third sub-feature determines the first sub-feature, and determines the second sub-feature based on the fourth sub-feature, including:

determining a first occlusion score based on the third sub-feature, and determining a second occlusion score based on the fourth sub-feature, using a third sub-network of the first model;

Utilizing the fourth sub-network, based on the third sub-feature and the first occlusion score, the first sub-feature is determined, and based on the fourth sub-feature and the second occlusion score, the Second sub-feature.
The method according to claim 11, wherein the third subnetwork comprises a pooling subnetwork and at least one occlusion erasure subnetwork, the first occlusion score comprises at least one first occlusion subscore, and the second The occlusion score comprises at least one second occlusion sub-score; said third sub-network using said first model determines a first occlusion score based on said third sub-feature, and determines a second occlusion score based on said fourth sub-feature scores, including:

using the pooling sub-network, dividing the third sub-feature into at least one third sub-part feature, and dividing the fourth sub-feature into at least one fourth sub-part feature;

Using each of said occlusion erasure sub-networks, each of said first occlusion sub-scores is determined based on each of said third sub-section features, and each of said Second occlusion subscore.
The method according to claim 12, wherein said first sub-feature is determined based on said third sub-feature and said first occlusion score using said fourth sub-network, and based on said fourth sub-features and the second occlusion score, determining the second sub-features, comprising:

Using the fourth sub-network, based on each of the third sub-part features and each of the first occlusion sub-scores of the third sub-features, a first sub-part feature is determined, and based on the fourth sub-feature each of said fourth sub-part features and each of said second occlusion sub-scores of features, determining a second sub-part feature;

Based on each of said first sub-part features, said first sub-feature is determined, and based on each of said second sub-part features, said second sub-feature is determined.
The method according to any one of claims 11 to 13, wherein the first sub-image includes label information, the first model includes a second feature memory, and the second feature memory includes information belonging to at least one At least one feature of the object; said determining a first target sub-loss value based on said first sub-feature and said second sub-feature includes:

determining an occlusion mask based on the first sub-image and the second sub-image;

determining a third loss value based on the first occlusion score, the second occlusion score and the occlusion mask;

determining a fourth loss value based on the first sub-feature, the second sub-feature and the tag information;

determining a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature memory;

The first target sub-loss value is determined based on the third loss value, the fourth loss value, and the fifth loss value.
The method according to claim 14, wherein said determining an occlusion mask based on said first sub-image and said second sub-image comprises:

dividing the first sub-image and the second sub-image into at least one first sub-part image and at least one second sub-part image, respectively;

determining an occlusion submask based on each of said first subsection images and each of said second subsection images;

Based on each of the occlusion sub-masks, the occlusion mask is determined.
The method according to claim 14 or 15, wherein said determining a third loss value based on said first occlusion score, said second occlusion score and said occlusion mask comprises:

determining a first sub-loss value based on the first occlusion score and the occlusion mask;

determining a second sub-loss value based on the second occlusion score and the occlusion mask;

The third loss value is determined based on the first sub-loss value and the second sub-loss value.
The method according to any one of claims 14 to 16, wherein said determining a fourth loss value based on said first sub-feature, said second sub-feature and said tag information comprises:

determining a third sub-loss value based on the first sub-feature and the tag information;

determining a fourth sub-loss value based on the second sub-feature and the tag information;

The fourth loss value is determined based on the third sub-loss value and the fourth sub-loss value.
The method according to any one of claims 14 to 17, wherein said at least one feature based on at least one object in said first sub-feature, said second sub-feature and said second feature memory bank , to determine the fifth loss value, including:

determining a third feature center of the first object and a fourth feature center of at least one of the second objects from at least one feature of at least one object in the second feature memory;

determining a fifth sub-loss value based on said first sub-feature, said third feature center, and each of said fourth feature centers;

determining a sixth sub-loss value based on said second sub-feature, said third feature center, and each of said fourth feature centers;

The fifth loss value is determined based on the fifth sub-loss value and the sixth sub-loss value.
The method according to any one of claims 14 to 18, wherein said second network comprises a fifth sub-network and a sixth sub-network; said second network utilizing said first model is based on at least one first For the second feature of the second object, update the first sub-feature and the second sub-feature respectively to obtain the first target sub-feature corresponding to the first sub-feature and the second target sub-feature corresponding to the second sub-feature Targeted sub-features, including:

Using the fifth sub-network, the first sub-feature and the second sub-feature are respectively aggregated with the second feature of at least one second object to obtain a first aggregated sub-feature corresponding to the first sub-feature a second aggregated sub-feature corresponding to the second sub-feature;

Using the sixth sub-network, the first target sub-feature is determined based on the first aggregated sub-feature, and the second target sub-feature is determined based on the second aggregated sub-feature.
The method according to claim 19, wherein said utilizing said fifth sub-network to aggregate said first sub-feature and said second sub-feature respectively with a second feature of at least one second object to obtain The first aggregate sub-feature corresponding to the first sub-feature and the second aggregate sub-feature corresponding to the second sub-feature include:

Based on the first sub-features and each of the second features, a first attention matrix is determined, and the first attention matrix is used to characterize the relationship between the first sub-features and each of the second features Correlation;

determining the first aggregated sub-features based on each of the second features and each of the first attention matrices;

Based on the second sub-features and each of the second features, a second attention matrix is determined, and the second attention matrix is used to characterize the relationship between the second sub-features and each of the second features Correlation;

The second aggregation sub-feature is determined based on each of the second features and each of the second attention matrices.
The method according to claim 19 or 20, wherein the sixth sub-network includes a seventh sub-network and an eighth sub-network; and using the sixth sub-network, determining the The first target sub-feature, and determine the second target sub-feature based on the second aggregation sub-feature, including:

using the seventh sub-network, determining a fifth sub-feature based on the first aggregated sub-feature and the occlusion mask, and determining a sixth sub-feature based on the second aggregated sub-feature and the occlusion mask;

Using the eighth sub-network, based on the first sub-feature and the fifth sub-feature, determine the first target sub-feature, and based on the second sub-feature and the sixth sub-feature, determine the Describe the second target sub-feature.
An image recognition method, the method comprising:

acquire the first image and the second image;

Use the trained target model to identify the object in the first image and the object in the second image to obtain a recognition result, wherein the trained target model includes: using any one of claims 1 to 21 A first model obtained by the model training method; the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.
A model training device, said device comprising:

a first acquisition part configured to acquire a first image sample containing a first object;

The feature extraction part is configured to use the first network of the first model to be trained to perform feature extraction on the first image sample to obtain the first feature of the first object;

The first updating part is configured to use the second network of the first model to update the first features respectively based on the second features of at least one second object to obtain the first features corresponding to the first features. Target features, the similarity between each second object and the first object is not less than a first threshold;

a first determining part configured to determine a target loss value based on the first target feature;

The second updating part is configured to update the model parameters of the first model at least once based on the target loss value to obtain the trained first model.
An image recognition device, the device comprising:

a second acquisition part configured to acquire the first image and the second image;

The identification part is configured to use the trained target model to identify the object in the first image and the object in the second image to obtain a recognition result, wherein the trained target model includes: using the right The first model obtained by the model training method described in any one of 1 to 21 is required; the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.
An electronic device, comprising a processor and a memory, the memory stores a computer program that can run on the processor, and the processor implements the method according to any one of claims 1 to 22 when executing the computer program.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method according to any one of claims 1 to 22 is implemented.
A computer program product, the computer program product comprising a computer program or an instruction, when the computer program or instruction is run on an electronic device, the electronic device is made to execute any one of claims 1 to 22. method.