CN114445681A

CN114445681A - Model training and image recognition method and device, equipment and storage medium

Info

Publication number: CN114445681A
Application number: CN202210107742.9A
Authority: CN
Inventors: 唐诗翔; 朱烽; 赵瑞
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-06
Also published as: WO2023142551A1

Abstract

The embodiment of the disclosure discloses a method, a device, equipment and a storage medium for model training and image recognition, wherein the method for model training comprises the following steps: obtaining a first image sample containing a first object; performing feature extraction on the first image sample by using a first network of a first model to be trained to obtain a first feature of the first object; updating the first characteristic by using a second network of the first model based on a second characteristic of at least one second object to obtain a first target characteristic corresponding to the first characteristic, wherein the similarity between each second object and the first object is not less than a first threshold value; determining a target loss value based on the first target feature; and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

Description

Model training and image recognition method and device, equipment and storage medium

Technical Field

The present disclosure relates to, but not limited to, the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for model training and image recognition.

Background

Object re-identification (also called Object re-identification) is a technique for determining whether a specific Object exists in an image or video sequence using computer vision techniques. Object re-identification is widely considered as a sub-problem of image retrieval, i.e. retrieving an image containing an object across devices, given an image containing the object. Factors such as differences between devices, shooting angles and environments can affect the result of object re-recognition.

Disclosure of Invention

The embodiment of the disclosure provides a model training and image recognition method, a model training and image recognition device, equipment and a storage medium.

The technical scheme of the embodiment of the disclosure is realized as follows:

the embodiment of the disclosure provides a model training method, which comprises the following steps:

obtaining a first image sample containing a first object;

performing feature extraction on the first image sample by using a first network of a first model to be trained to obtain a first feature of the first object;

updating the first characteristic by using a second network of the first model based on a second characteristic of at least one second object to obtain a first target characteristic corresponding to the first characteristic, wherein the similarity between each second object and the first object is not less than a first threshold value;

determining a target loss value based on the first target feature;

and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

The embodiment of the disclosure provides an image identification method, which comprises the following steps:

acquiring a first image and a second image;

recognizing the object in the first image and the object in the second image by using a trained target model to obtain a recognition result, wherein the trained target model comprises: a first model obtained by adopting the model training method; the recognition result represents that the object in the first image and the object in the second image are the same object or different objects.

The embodiment of the present disclosure provides a model training device, which includes:

a first acquisition module for acquiring a first image sample containing a first object;

the characteristic extraction module is used for extracting the characteristic of the first image sample by utilizing a first network of a first model to be trained to obtain a first characteristic of the first object;

a first updating module, configured to update, by using a second network of the first model, the first features based on second features of at least one second object, respectively, to obtain first target features corresponding to the first features, where a similarity between each second object and the first object is not less than a first threshold;

a first determination module to determine a target loss value based on the first target feature;

and the second updating module is used for updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

An embodiment of the present disclosure provides an image recognition apparatus, including:

the second acquisition module is used for acquiring the first image and the second image;

a recognition module, configured to recognize, by using a trained target model, an object in the first image and an object in the second image to obtain a recognition result, where the trained target model includes: a first model obtained by adopting the model training method; the recognition result represents that the object in the first image and the object in the second image are the same object or different objects.

An embodiment of the present disclosure provides an electronic device, including a processor and a memory, where the memory stores a computer program executable on the processor, and the processor implements the above method when executing the computer program.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method.

In the embodiment of the disclosure, a first image sample containing a first object is obtained; performing feature extraction on the first image sample by using a first network of a first model to be trained to obtain a first feature of a first object; updating the first characteristic by using a second network of the first model based on a second characteristic of at least one second object to obtain a first target characteristic corresponding to the first characteristic, wherein the similarity between each second object and the first object is not less than a first threshold value; determining a target loss value based on the first target feature; and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model. In this way, the feature of the second object is introduced into the feature level of the first image sample containing the first object as noise, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced, the performance of the first model can be improved, meanwhile, the model parameters of the first model are updated at least once under the condition that the target loss value does not meet the preset condition, and the target loss value is determined based on the first target feature, so that the predicted consistency of the trained first model to different image samples of the same object can be improved, and the trained first model can accurately re-identify the objects in the image containing a plurality of objects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic flow chart illustrating an implementation process of a model training method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart illustrating an implementation process of a model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart illustrating an implementation process of a model training method according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart illustrating an implementation of an image recognition method according to an embodiment of the present disclosure;

fig. 5A is a schematic structural diagram of a model training system according to an embodiment of the present disclosure;

FIG. 5B is a schematic diagram of a model training system according to an embodiment of the present disclosure;

fig. 5C is a schematic diagram of determining an occlusion mask according to an embodiment of the disclosure;

fig. 5D is a schematic diagram of a first network provided by an embodiment of the present disclosure;

fig. 5E is a schematic diagram of a second sub-network provided by an embodiment of the disclosure;

fig. 5F is a schematic diagram of a second network provided by the embodiments of the present disclosure;

fig. 5G is a schematic diagram of obtaining a target loss value according to an embodiment of the disclosure;

FIG. 5H is a schematic diagram illustrating an occlusion score of a pedestrian image according to an embodiment of the present disclosure;

fig. 5I is a schematic diagram of a result of image retrieval according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of a hardware entity of an electronic device in an embodiment of the disclosure.

Detailed Description

For the purpose of making the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be described in further detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present disclosure, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the description that follows, references to the terms "first \ second \ third" are intended merely to distinguish similar objects and do not denote a particular order, but rather are to be understood that "first \ second \ third" may, where permissible, be interchanged in a particular order or sequence so that embodiments of the disclosure described herein can be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the disclosure only and is not intended to be limiting of the disclosure.

In the related art, most algorithms adopt a deep neural network to extract image features, and then realize a retrieval function through distance measurement. However, due to the complex pedestrian re-recognition scene, the target pedestrian is often blocked by a non-pedestrian object or interfered by a non-target pedestrian. The algorithms do not recognize the influence of the occlusion problem on the retrieval accuracy, and the extracted pedestrian feature representation contains a large amount of noise, so that the retrieval accuracy is reduced. Although some existing algorithms introduce a human body analysis algorithm or a posture estimation algorithm to assist a pedestrian re-recognition model in extracting pedestrian features, the human body analysis algorithm and the posture estimation algorithm are low in robustness and difficult to provide accurate auxiliary information, and even misleading the model to extract wrong features reduces the accuracy rate of retrieval.

The embodiment of the disclosure provides a model training method, which introduces the feature of a second object as noise into the feature level of a first image sample containing a first object, trains the overall network structure of the first model, so as to enhance the robustness of the first model and improve the performance of the first model, and meanwhile, updates the model parameters of the first model at least once under the condition that a target loss value does not meet a preset condition. The model training method and the image recognition method provided by the embodiments of the present disclosure may be executed by an electronic device, and the electronic device may be various types of terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented as a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

In the following, the technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the drawings in the embodiments of the present disclosure.

Fig. 1 is a schematic flow chart of an implementation of a model training method provided in an embodiment of the present disclosure, as shown in fig. 1, the method includes steps S11 to S15:

step S11, a first image sample containing a first object is acquired.

Here, the first image sample may be any suitable image containing at least the first object. The content contained in the first image sample may be determined according to an actual application scenario, for example, including only the first object, or including at least one of the first object and an object, or other objects. The first object may include, but is not limited to, a human, an animal, a plant, an item, and the like. For example, the first image sample is a face image containing three pictures. As another example, the first image sample is an image of an entire person containing lie four.

In some embodiments, the first image sample may include one image, or may include two or more images.

For example, the first image sample is any one of the images in the training set.

For another example, the first image sample includes a first sub-image and a second sub-image, where the first sub-image is an image in the training set, and the second sub-image is an image obtained by performing augmentation processing on the first sub-image.

Here, the augmentation process may include, but is not limited to, at least one of a shading process, a scaling process, a cropping process, a resizing process, a padding process, a flipping process, a color dithering process, a gradation process, a gaussian blur process, a random erasing process, and the like. In implementation, a person skilled in the art may obtain the second sub-image by applying a suitable augmentation process to the first sub-image according to actual situations, and the embodiment of the disclosure is not limited.

For another example, the first image sample includes a first sub-image and a plurality of second sub-images, where the first sub-image is an image in the training set, and each second sub-image is an image obtained by performing augmentation processing on the first sub-image.

Step S12, performing feature extraction on the first image sample by using a first network of the first model to be trained to obtain a first feature of the first object.

Here, the first model may be any suitable model for object recognition based on image features. The first model may include at least a first network. The first feature may include, but is not limited to, an original feature of the first image sample, a feature obtained by processing the original feature. The original features may include, but are not limited to, facial features, body features, etc. of the first object contained in the image.

In some embodiments, the first network may include at least a first subnetwork for extracting features of the image with a feature extractor. The feature extractor may include, but is not limited to, a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), a converter-based feature extraction Network (Transform), and the like. In practice, a person skilled in the art may adopt an appropriate first network in the first model according to actual situations to obtain the first feature, and the embodiment of the present disclosure is not limited.

For example, a third feature of the first image sample is extracted through the first sub-network and determined as the first feature of the first object. Here, the third feature may include, but is not limited to, an original feature of the first image sample, and the like.

In some embodiments, the first network may further include a second sub-network for determining the first feature of the first object based on the third feature of the first image sample.

In some embodiments, the second sub-network may include an occlusion erase network for performing occlusion erase processing on the input third feature to obtain the first feature of the first object.

Step S13, updating the first feature based on the second feature of the at least one second object by using the second network of the first model, to obtain a first target feature corresponding to the first feature.

Here, the similarity of each second object to the first object is not less than the first threshold. The first threshold may be preset or statistically obtained. In implementation, a person skilled in the art may autonomously determine the setting manner of the first threshold according to actual needs, and the disclosure is not limited thereto.

For example, the similarity between the long-phase features of the second object and the first object is not less than the first threshold. For another example, the degree of similarity between the wearing characteristics of the second object and the first object is not less than the first threshold. For another example, the similarity between the long-phase features of the second object and the first object and the similarity between the wearing features are not less than the first threshold.

The second feature may be obtained based on a training set, or may be input in advance. The second object may include, but is not limited to, a human, an animal, a plant, an item, and the like.

In some embodiments, the similarity of each second object to the first object may be obtained based on a similarity between the second feature of each second object and the first feature of the first object.

In some embodiments, the similarity of each second object to the first object may be obtained based on a similarity between the feature center of each second object and the first feature of the first object.

Here, the first model may include a second memory feature library, which may include at least one feature of the at least one object. The feature center of the second object may be derived based on at least one feature belonging to the second object in the second memory feature library. In some embodiments, features of a plurality of image samples of at least one subject in the training set may be extracted and the extracted features stored as identities in a second memory feature library.

In some embodiments, the second network may include a fifth sub-network and a sixth sub-network, the fifth sub-network configured to aggregate the second feature with the first feature to obtain a first aggregated sub-feature; the sixth sub-network is configured to update the first aggregated sub-feature to obtain the first target feature.

Step S14, a target loss value is determined based on the first target feature.

Here, the target loss value may include, but is not limited to, at least one of a mean square error loss value, a cross entropy loss value, a contrast loss value, and the like.

And step S15, updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

Here, it may be determined whether the model parameters of the first model need to be updated based on the target loss value. For example, the target loss value is compared with a threshold value, and the model parameters of the first model are updated when the target loss value is greater than the threshold value; in the case where the target loss value is not greater than the threshold value, the first model is determined as a trained first model. For another example, the target loss value is compared with the last target loss value, and the model parameters of the first model are updated when the target loss value is greater than the last target loss value; in the case where the target loss value is substantially equal to the last target loss value, the first model is determined as a trained first model.

In the embodiment of the disclosure, a first image sample containing a first object is obtained; performing feature extraction on the first image sample by using a first network of a first model to be trained to obtain a first feature of a first object; updating the first characteristics by utilizing a second network of the first model based on the second characteristics of at least one second object to obtain first target characteristics corresponding to the first characteristics, wherein the similarity between each second object and the first object is not less than a first threshold value; determining a target loss value based on the first target feature; and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model. In this way, the feature of the second object is introduced into the feature level of the first image sample containing the first object as noise, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced, the performance of the first model can be improved, meanwhile, the model parameters of the first model are updated at least once under the condition that the target loss value does not meet the preset condition, and the target loss value is determined based on the first target feature, so that the predicted consistency of the trained first model to different image samples of the same object can be improved, and the trained first model can accurately re-identify the objects in the image containing a plurality of objects.

In some embodiments, the first image sample comprises label information, the first model comprises a first feature memory comprising at least one feature belonging to at least one object, and said step S14 comprises steps S141 to S143, wherein:

step S141, determining a first loss value based on the first target feature and the tag information.

Here, the tag information may include, but is not limited to, a tag value, identification, and the like. The first penalty value may include, but is not limited to, a cross-entropy penalty value, or the like.

In some embodiments, the first loss value may be calculated by the following equation (1-1):

wherein W is a linear matrix, W_iAnd W_jIs an element of W, y_iTag information indicating the ith object, f_iFirst object feature, ID, representing the ith object_sRepresenting the total number of subjects in the training set.

Step S142, determining a second loss value based on the first target feature and at least one feature of at least one object in the first feature memory library.

Here, at least one feature of the first object and at least one feature of the at least one second object are stored in the first feature repository. The second loss value may include, but is not limited to, contrast loss, and the like.

And step S143, determining a target loss value based on the first loss value and the second loss value.

Here, the target loss value may include, but is not limited to, a sum between the first loss value and the second loss value, a sum after weighting the first loss value and the second loss value, respectively, and the like. In practice, a person skilled in the art may determine the target loss value according to actual requirements, and the embodiment of the disclosure is not limited.

In some embodiments, the target loss value may be calculated by the following equation (1-2):

wherein the content of the first and second substances,

which represents the value of the first loss to be,

representing a second loss value.

In some embodiments, the step S142 includes steps S1421 to S1422, wherein:

step S1421, determining a first feature center of the first object and a second feature center of the at least one second object from the at least one feature of the at least one object in the first feature memory library.

In some embodiments, the first feature center may be determined based on a feature of the first object in the first feature memory bank and the first target feature. Each second feature center may be determined based on each feature of each second object in the second feature memory library.

In some embodiments, the feature center of each object may be calculated by the following formula (1-3):

wherein, c_kRepresenting the characteristic center of the k-th object, B_kRepresenting a set of features belonging to the kth object in the minibatch, m being the set updated momentum coefficient, f_i' is the first feature of the ith sample. In some embodiments, m may be 0.2.

In some embodiments, at f_i' and B_kAll belonging to the same object, the feature center c of the object_kWill vary at f_i' and B_kIn the case of not belonging to the same object, the feature center c belonging to that object_kAnd the last time c_kAre consistent.

Step S1422, a second loss value is determined based on the first target feature, the first feature center, and each second feature center.

In some embodiments, the second loss value may be calculated by the following equation (1-4):

where τ is a predefined temperature parameter, c_iFirst feature center representing the ith object, c_jRepresenting each second feature center, f_iFirst object feature, ID, representing the ith object_SRepresenting the total number of subjects in the training set.

In some embodiments, the step S15 includes a step S151 or a step S152:

step S151, under the condition that the target loss value does not meet the preset condition, updating the model parameters of the first model to obtain an updated first model; based on the updated first model, a trained first model is determined.

Here, the manner of updating the model parameters of the first model may include, but is not limited to, at least one of a gradient descent method, a momentum update method, a newton momentum method, and the like. In implementation, a person skilled in the art may autonomously determine an update mode according to actual needs, and the embodiment of the present disclosure is not limited.

And S152, determining the updated first model as the trained first model under the condition that the target loss value meets a preset condition.

Here, the preset condition may include, but is not limited to, the target loss value being less than the threshold value, convergence of the change in the target loss value, and the like. In implementation, a person skilled in the art may autonomously determine the preset condition according to actual requirements, and the embodiment of the disclosure is not limited.

In some embodiments, the determining the trained first model based on the updated first model includes steps S1511 to S1515, wherein:

step S1511, obtaining the next first image sample;

step 1512, performing feature extraction on a next first image sample by using the updated first network of the first model to be trained to obtain a next first feature;

step S1513, updating the next first feature based on the second feature of the at least one second object by using the updated second network of the first model to obtain a next first target feature corresponding to the next first feature;

step S2514, determining a next target loss value based on the next first target characteristic;

step S2515, based on the next target loss value, performing at least one next update on the updated model parameters of the first model to obtain the trained first model.

Here, the above steps S2511 to S2515 correspond to the above steps S11 to S15, respectively, and the embodiments of the above steps S11 to S15 may be referred to for implementation.

In the embodiment of the disclosure, the model parameter of the first model may be updated next time when the target loss value does not satisfy the preset condition, and the trained first model is determined based on the first model updated next time, so that the performance of the trained first model may be further improved through continuous iterative update.

In some embodiments, the first feature repository comprises feature sets belonging to at least one object, each feature set comprising at least one feature of the object, and the method further comprises:

step S16, based on the first target feature, updating the feature set belonging to the first object in the first feature memory base.

Here, the updating manner may include, but is not limited to, adding the first target feature to the first feature memory bank, replacing a feature in the first feature memory bank with the first target feature, and the like.

In the embodiment of the disclosure, by updating the features of the first object in the first feature memory library, the first feature center belonging to the first object can be accurately obtained, and the recognition accuracy of the trained first model is further improved.

Fig. 2 is a schematic flow chart of an implementation of a model training method provided in an embodiment of the present disclosure, as shown in fig. 2, the method includes steps S21 to S25:

step S21, a first sub-image and a second sub-image containing the first object are acquired.

Here, the second sub-image may be an image in which at least the first sub-image is subjected to occlusion processing. The second sub-image may include at least one image. In some embodiments, when the second sub-image includes a plurality of images, the plurality of images may be images obtained by performing at least occlusion processing on the first sub-image.

Performing at least occlusion processing may include, but is not limited to, only occlusion processing, or occlusion processing and other processing, and the like. In some embodiments, the other processing may include, but is not limited to, at least one of scaling processing, cropping processing, resizing processing, padding processing, flipping processing, color dithering processing, grayscale processing, gaussian blur processing, random wipe processing, and the like. In implementation, a person skilled in the art may obtain the second sub-image by applying an appropriate processing manner to the first sub-image according to actual situations, and the embodiment of the disclosure is not limited.

In some embodiments, the step S21 includes steps S211 to S212, wherein:

step S211, a first sub-image containing the first object is acquired.

Here, the first sub-image may be any suitable image containing at least the first object. The content contained in the first sub-image may be determined according to an actual application scenario, for example, including only the first object, or including at least one of the first object and an object, or other objects. The first object may include, but is not limited to, a human, an animal, a plant, an item, and the like. For example, the first sub-image is a face image containing zhang. For another example, the first sub-image is an image of the entire person including lie four.

Step S212, at least carrying out occlusion processing on the first sub-image based on a preset occlusion set to obtain a second sub-image.

Here, the occlusion set includes at least one occlusion image. The occlusion set may include, but is not limited to, being established based on at least one of a training set, other images, and the like. Wherein, the occlusion set at least comprises a plurality of occlusion object images, background images, and the like, such as: leaves, vehicles, trash cans, buildings, trees, flowers, and the like. For example, image samples of background and object occlusions are found in the training set and the occlusion parts are manually cropped out to compose an occlusion library. For another example, an appropriate image containing at least one object occlusion is selected, and the occlusion part is manually cropped to form an occlusion library. In implementation, a person skilled in the art may select an appropriate manner to establish the occlusion set according to actual requirements, and the embodiment of the present disclosure is not limited.

The preset rules may include, but are not limited to, specifying a location, specifying a size, and the like. In some embodiments, since occlusion often occurs in the quarter to half area of the four positions, top, bottom, left, and right, the designated position may be set to be in the quarter to half area of the four positions. In implementation, a person skilled in the art may determine the position of the obstruction according to actual requirements, and the embodiments of the present disclosure are not limited.

In some embodiments, performing at least occlusion processing may include, but is not limited to, occlusion processing and other processing.

For example, when at least the occlusion processing includes occlusion processing and resizing, an occlusion image is randomly selected from an occlusion library, the occlusion image is resized based on a resizing rule, and the resized occlusion image is pasted to the lower right corner of the first image sample based on a preset rule.

Here, the adjustment rule may include, but is not limited to, a size of the obstruction image, a size of the first image sample, and the like. For example, if the height of the obstruction image exceeds twice the width of the obstruction image, considered a vertical obstruction, the height of the obstruction image can be adjusted to the vertical height of the obstruction image, the width of the obstruction image adjusted to one quarter to one half of the width of the first image sample; otherwise, the image is regarded as a horizontal occlusion, the width of the occlusion image can be adjusted to the horizontal width of the occlusion image, and the height of the occlusion image is adjusted to one quarter to one half of the height of the first image sample. In implementation, a person skilled in the art may determine the adjustment rule according to actual requirements, and the embodiments of the present disclosure are not limited.

For another example, in the case where at least the occlusion processing including the occlusion processing, the resizing processing, the padding processing, and the cropping processing is performed, first, the resizing processing, the padding processing, and the cropping processing are performed on the first image sample; secondly, randomly selecting a shielding object image from a shielding library, and carrying out size adjustment processing on the shielding object image based on an adjustment rule; then, based on a preset rule, one corner of the first image sample is randomly selected as a starting point, and the resized obstruction image is pasted to the starting point.

In some embodiments, the method further comprises step S213:

step S213 determines an occlusion mask based on the first sub-image and the second sub-image.

Here, the occlusion mask is used to represent occlusion information of an image. The occlusion mask may be used for training of the first model for occlusion of objects.

In some implementations, the occlusion mask can be determined based on a pixel difference between the first sub-image and the second sub-image. In practice, the difference between the first sub-image and the second sub-image may be calculated based on the following equation (2-1):

d＝|x-x′| (2-1)；

where x denotes the first sub-image and x' denotes the second sub-image.

In some embodiments, the step S213 includes steps S2131 to S2133, wherein:

step S2131, divide the first sub-image and the second sub-image into at least one first subsection image and at least one second subsection image.

In some embodiments, fine-grained occlusion masks, which are prone to many false labels due to semantic (body part) misalignments between different images, may be roughly divided horizontally into a plurality of portions, and determined based on pixel differences between each portion of the first sub-image and each portion of the second sub-image. For example, four sections, five sections, etc. In implementation, a person skilled in the art may divide the image according to actual requirements, and the embodiment of the present disclosure is not limited.

Step S2132, determining an occlusion sub-mask based on each first subsection image and each second subsection image.

In some embodiments, the pixel difference between each first subsection image and each second subsection image may be derived based on the above equation (2-1), and each occlusion sub-mask may be determined based on the pixel difference of each subsection.

Step S2133, determining the occlusion mask based on each occlusion sub-mask.

In some embodiments, at d_iIf the value is not less than the first threshold value, the partial image is indicated to be blocked, and the mask is used_iCan be set to 0, otherwise, the part is not blocked, and the mask_iMay be set to 1, then the corresponding occlusion mask is the occlusion mask for each portion.

For example, each of the first sub-image and the second sub-image is divided into four parts, and if there is no occlusion in the first part, the second part, and the third part, and there is an occlusion in the fourth part, then the occlusion mask at this time should be 1110. In implementation, a person skilled in the art may determine the occlusion mask according to actual requirements, and the embodiments of the present disclosure are not limited.

Step S22, performing feature extraction on the first sub-image by using the first network of the first model to be trained to obtain a first sub-feature of the first object, and performing feature extraction on the second sub-image to obtain a second sub-feature of the first object.

Here, the first model may be any suitable model for object recognition based on image features. The first model may include at least a first network. The first sub-feature may include, but is not limited to, an original feature of the first sub-image, and a feature obtained by processing the original feature. The second sub-feature may include, but is not limited to, an original feature of the second sub-image, and a feature obtained by processing the original feature. The raw features may include, but are not limited to, facial features, body features, etc. of the object contained in the image.

Step S23, updating the first sub-feature and the second sub-feature based on the second feature of the at least one second object by using the second network of the first model, respectively, to obtain a first target sub-feature corresponding to the first sub-feature and a second target sub-feature corresponding to the second sub-feature.

Here, the degree of similarity of each second object to the first object is not less than the first threshold. The first threshold may be preset or statistically obtained. In implementation, a person skilled in the art may autonomously determine the setting manner of the first threshold according to actual needs, and the disclosure is not limited thereto.

And step S24, determining a target loss value based on the first target sub-feature and the second target sub-feature.

And step S25, updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

Here, the step S25 corresponds to the step S15, and the embodiment of the step S15 may be referred to for implementation.

In the embodiment of the present disclosure, by acquiring a first sub-image and a second sub-image including a first object, the second sub-image is an image obtained by performing at least occlusion processing on the first sub-image; performing feature extraction on the first sub-image by using a first network of a first model to be trained to obtain a first sub-feature of the first object, and performing feature extraction on the second sub-image to obtain a second sub-feature of the first object; respectively updating the first sub-feature and the second sub-feature based on the second feature of at least one second object by using a second network of the first model to obtain a first target sub-feature corresponding to the first sub-feature and a second target sub-feature corresponding to the second sub-feature, wherein the similarity between each second object and the first object is not less than a first threshold value; determining a target loss value based on the first target sub-feature and the second target sub-feature; and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model. In this way, the characteristics of the object image and other objects are respectively introduced into the picture level and the characteristic level of the first image sample containing the first object to be used as noise, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced, the performance of the first model can be improved, meanwhile, under the condition that the target loss value does not meet the preset condition, the model parameters of the first model are updated at least once, and the target loss value is determined based on the first target characteristics, so that the predicted consistency of the trained first model for different image samples of the same object can be improved, and the trained first model can accurately re-identify the object in the image containing object shielding and/or multiple objects.

In some embodiments, the step S24 includes steps S241 to S243, wherein:

and step S241, determining a first target loss value based on the first target sub-characteristic and the second target sub-characteristic.

Here, the first target loss value may include, but is not limited to, at least one of a mean square error loss value, a cross entropy loss value, a contrast loss value, and the like.

In some embodiments, the step S241 includes steps S2411 to S2413, wherein:

step S2411, determining a third target sub-loss value based on the first target sub-characteristics.

The step S2411 corresponds to the step S14, and the embodiment of the step S14 may be referred to for implementation.

Step S2412, determining a fourth target sub-loss value based on the second target sub-characteristics.

The step S2412 corresponds to the step S14, and the embodiment of the step S14 may be referred to for implementation.

Step S2413, determining a first target loss value based on the third target sub-loss value and the fourth target sub-loss value.

Here, the first target loss value may include, but is not limited to, a sum between the third target sub-loss value and the fourth target sub-loss value, a sum after the third target sub-loss value and the fourth target sub-loss value are weighted respectively, and the like. In practice, a person skilled in the art may determine the first target loss value according to actual requirements, and the embodiment of the disclosure is not limited.

And step S242, determining a second target loss value based on the first sub-characteristic and the second sub-characteristic.

Here, the second target loss value may include, but is not limited to, at least one of a mean square error loss value, a cross entropy loss value, a contrast loss value, and the like.

Step S243, determining a target loss value based on the first target loss value and the second target loss value.

Here, the target loss value may include, but is not limited to, a sum between the first target loss value and the second target loss value, a sum after weighting the first target loss value and the second target loss value, respectively, and the like. In practice, a person skilled in the art may determine the target loss value according to actual requirements, and the embodiment of the disclosure is not limited.

In an embodiment of the present disclosure, a target loss value is determined based on the first sub-feature, the second sub-feature, the first target sub-feature, and the second target sub-feature. In this way, the accuracy of the target loss value can be improved so as to accurately judge whether the first model converges.

In some embodiments, the first network comprises a first sub-network and a second sub-network, and the step S22 comprises steps S221 to S222, wherein:

step S221, feature extraction is respectively carried out on the first sub-image and the second sub-image by utilizing the first sub-network of the first model to be trained, and a third sub-feature corresponding to the first sub-image and a fourth sub-feature corresponding to the second sub-image are obtained.

Here, the first network comprises at least a first sub-network for extracting features of the image with a feature extractor. The feature extractor may include, but is not limited to, a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), a converter-based feature extraction Network (Transform), and the like. In practice, a person skilled in the art may use an appropriate first sub-network in the first model to obtain the third sub-feature according to practical situations, and the embodiments of the present disclosure are not limited thereto.

For example, a feature of the first sub-image is extracted by the first sub-network and determined as a third sub-feature of the first object. Here, the third sub-feature may include, but is not limited to, an original feature of the first sub-image, and the like.

Step S222, determining the first sub-feature based on the third sub-feature and determining the second sub-feature based on the fourth sub-feature using the second sub-network of the first model.

In some embodiments, the second subnetwork may include an occlusion erase network for performing occlusion erase processing on the input features and outputting the non-occluded features. For example, the first sub-feature of the first object is obtained by performing occlusion erasure processing on the third sub-feature through the second sub-network. For another example, the second sub-feature of the first object is obtained by performing occlusion erasure processing on the fourth sub-feature through the second sub-network.

In the embodiment of the disclosure, an object image is introduced as noise in a picture layer of a first image sample containing a first object, and an overall network structure of a first model is trained, so that robustness of the first model can be enhanced, performance of the first model can be improved, and the trained first model can accurately re-identify the object in the image containing the object occlusion.

In some embodiments, the step S242 includes steps S2421 to S2423, wherein:

and step S2421, determining a first target sub-loss value based on the first sub-feature and the second sub-feature.

Here, the first target sub-loss value may include, but is not limited to, at least one of a mean square error loss value, a cross entropy loss value, a contrast loss value, and the like.

And step S2422, determining a second target sub-loss value based on the third sub-feature and the fourth sub-feature.

Here, the second target sub-loss value may include, but is not limited to, at least one of a mean square error loss value, a cross entropy loss value, a contrast loss value, and the like.

Step S2423, determining a second target loss value based on the first target sub-loss value and the second target sub-loss value.

Here, the second target loss value may include, but is not limited to, a sum between the first target sub-loss value and the second target sub-loss value, a sum after the first target sub-loss value and the second target sub-loss value are weighted, respectively, and the like. In practice, a person skilled in the art may determine the second target loss value according to actual requirements, and the embodiment of the disclosure is not limited.

In an embodiment of the present disclosure, the second target loss value is determined based on the first sub-feature, the second sub-feature, the third sub-feature, and the fourth sub-feature. In this way, the accuracy of the second target loss value can be improved, so as to accurately judge whether the first model converges.

In some embodiments, the first sub-image includes label information, and the step S2422 includes steps S251 to S253, wherein:

and step S251, determining a seventh sub-loss value based on the third sub-characteristics and the label information.

Here, the tag information may include, but is not limited to, a tag value, an identification, and the like. The seventh sub-penalty value may include, but is not limited to, a cross entropy penalty value. In some embodiments, the seventh sub-loss value may be calculated by the above equation (1-1), when f in equation (1-1)_iIs the third sub-feature.

And step S252, determining an eighth sub-loss value based on the fourth sub-feature and the label information.

Here, the eighth sub-penalty value may include, but is not limited to, a cross-entropy penalty value. In some embodiments, the eighth sub-loss value may be determined according to the above equation (1-1), when f in equation (1-1)_iIs the fourth sub-feature.

And step 253, determining a second target sub-loss value based on the seventh sub-loss value and the eighth sub-loss value.

Here, the second target sub-loss value may include, but is not limited to, a sum between the seventh sub-loss value and the eighth sub-loss value, a sum after the seventh sub-loss value and the eighth sub-loss value are weighted, respectively, and the like. In implementation, a person skilled in the art may determine the second target sub-loss value according to actual requirements, and the embodiments of the present disclosure are not limited.

In an embodiment of the present disclosure, the second target sub-loss value is determined based on the third sub-feature, the fourth sub-feature and the tag information. In this way, the accuracy of the second target sub-loss value can be improved, so as to accurately judge whether the first model converges.

In some embodiments, the second sub-network comprises a third sub-network and a fourth sub-network, and the step S222 comprises steps S2221 to S2222, wherein:

step S2221, determining, with a third sub-network of the first model, a first occlusion score based on the third sub-feature and a second occlusion score based on the fourth sub-feature.

Here, the second sub-network comprises at least a third sub-network for performing semantic analysis based on features of the image to obtain an occlusion score corresponding to the image.

In some embodiments, the third subnetwork comprises a pooling subnetwork and at least one occlusion wipe subnetwork, the first occlusion score comprising at least one first occlusion sub-score, the second occlusion score comprising at least one second occlusion sub-score, said step S2221 comprising steps 261-262, wherein:

step S261, using the pooling sub-network, divides the third sub-feature into at least one third sub-portion feature, and divides the fourth sub-feature into at least one fourth sub-portion feature.

Here, the pooling sub-network is used to partition the input feature into at least one sub-portion feature of the feature. The number of third sub-portion features may be the same as the number of first sub-images. For example, by dividing the first sub-image into four portions, the third sub-feature may be divided into three third sub-portion features by the pooling sub-network, each third sub-portion feature corresponding to f_i。

Step S262, with each occlusion erase sub-network, determines a first occlusion sub-score based on each third sub-portion feature and a second occlusion sub-score based on each fourth sub-portion feature.

Here, each occlusion erasure sub-network is used to perform semantic analysis on the input feature to obtain an occlusion score for the image corresponding to the feature.

In some embodiments, each occlusion erasure subnetwork comprises two fully connected layers, a layer normalization and an activation function, wherein the layer normalization is located between the two fully connected layers and the activation function is located at the end.

In some embodiments, the activation function may be a Sigmoid function.

In some embodiments, the number of occlusion wipe sub-networks is the same as the number of first sub-image partitions.

For example, the first sub-image is divided into four parts, each part corresponding to a feature f_iThe third sub-network now comprises four occlusion-erasure sub-networks, each for outputting f_iThe corresponding occlusion score. For another example, the first sub-image is divided into five parts, and each part has a corresponding feature of f_iThe third sub-network now comprises five occlusion-erasing sub-networks, each for outputting f_iThe corresponding occlusion score.

In some embodiments, the occlusion score may be calculated by the following equation (2-2):

s_i＝Sigmoid(W_rgLN(W_cpf_i)) (2-2)；

wherein, W_cpIs a matrix of the number of pixels in the matrix,

W_rgis a matrix of the number of pixels in the matrix,

LN is layer normalized, c denotes the channel dimension, f_iRepresenting a feature of the ith part in the third or fourth sub-feature.

For example, the third sub-feature is divided into four third sub-feature by the pooling sub-network, each third sub-feature is input into a corresponding occlusion erasure sub-network, based on the first fully-connected layer W_cpCompressing the channel dimension to one fourth of the original dimension, and performing layer normalization on the features of the compressed channel dimensionCompressing the layer normalized feature to one dimension, and outputting a first shielding sub-fraction s corresponding to the third sub-part feature through a Sigmoid function_i。

Step S2222, determining, by the fourth sub-network, the first sub-feature based on the third sub-feature and the first occlusion score, and determining the second sub-feature based on the fourth sub-feature and the second occlusion score.

Here, the second sub-network further comprises a fourth sub-network for determining the feature after occlusion erasure.

In some embodiments, the step S2222 includes steps S271 to 272, wherein:

step S271, determining, with the fourth sub-network, a first sub-portion feature based on each third sub-portion feature of the third sub-features and each first occlusion sub-score, and a second sub-portion feature based on each fourth sub-portion feature of the fourth sub-features and each second occlusion sub-score. In some embodiments, the first sub-portion feature or the second sub-portion feature may be calculated by the following equation (2-3):

f_i′＝s_i·f_i (2-3)；

here, s_iRepresents the ith occlusion fraction, f_iRepresenting the ith third sub-portion feature or the fourth sub-portion feature.

In some embodiments, the second feature repository may be updated based on the first sub-feature. The updating method may include, but is not limited to, adding the first sub-feature to the second feature memory bank, replacing a feature in the second feature memory bank with the first sub-feature, and the like.

Step S272, the first sub-feature is determined based on each first sub-portion feature, and the second sub-feature is determined based on each second sub-portion feature.

In some embodiments, the first sub-features may be obtained by stitching at least one first sub-feature.

In some embodiments, the first sub-image includes label information, the first model includes a second feature memory bank including at least one feature belonging to at least one object, the step S2421 includes steps S281 to S285, wherein:

step S281, determining an occlusion mask based on the first sub-image and the second sub-image.

Here, the step S281 corresponds to the step S213, and the embodiment of the step S213 may be referred to when implemented.

Step S282 determines a third loss value based on the first occlusion score, the second occlusion score, and the occlusion mask.

Here, the third loss value may include, but is not limited to, a mean square error loss value.

And step S283, determining a fourth loss value based on the first sub-characteristic, the second sub-characteristic and the label information.

Here, the fourth penalty value may include, but is not limited to, a cross entropy penalty value.

Step S284, determining a fifth loss value based on the first sub-feature, the second sub-feature and at least one feature of at least one object in the second feature memory base.

Here, the fifth loss value may include, but is not limited to, a contrast loss value.

Step S285, determining a first target sub-loss value based on the third loss value, the fourth loss value, and the fifth loss value.

Here, the first target sub-loss value may include, but is not limited to, a sum between the third loss value, the fourth loss value, and the fifth loss value, a sum after weighting the third loss value, the fourth loss value, and the fifth loss value, respectively, and the like. In implementation, a person skilled in the art may determine the first target sub-loss value according to actual requirements, and the embodiments of the present disclosure are not limited thereto.

In an embodiment of the present disclosure, a first target sub-penalty value is determined based on the occlusion mask, the first sub-feature, the second sub-feature, the tag information, and features of other objects. In this way, the accuracy of the first target sub-loss value may be improved so as to accurately judge whether the first model converges.

In some embodiments, the step S282 includes steps S2821 to S2823, wherein:

step S2821, determining a first sub-loss value based on the first occlusion fraction and the occlusion mask.

Here, the first sub-loss value may include, but is not limited to, a mean square error loss value.

In some embodiments, the first sub-loss value may be calculated according to the following equation (2-4):

where N is the total number of occlusion erasure sub-networks, s_iIndicates the ith occlusion score, mask_iRepresenting the ith occlusion sub-mask in the occlusion mask. For example, in the case where the occlusion mask is 1110, the mask at this time₁Is 1, mask₄Is 0.

Step S2822, determining a second sub-loss value based on the second occlusion score and the occlusion mask.

Here, the second sub-loss value may include, but is not limited to, a mean square error loss value.

Here, the determination of the second sub loss value may be the same as the determination of the first sub loss value, see step S2821.

And step S2823, determining a third loss value based on the first sub loss value and the second sub loss value.

Here, the third loss value may include, but is not limited to, a sum between the first sub-loss value and the second sub-loss value, a sum after the first sub-loss value and the second sub-loss value are weighted, respectively, and the like. In implementation, a person skilled in the art may determine the third sub-loss value according to actual requirements, and the embodiments of the present disclosure are not limited.

In an embodiment of the disclosure, a third penalty value is determined based on the first occlusion score, the second occlusion score, and the occlusion mask. In this way, the accuracy of the third loss value can be improved, so as to accurately judge whether the first model converges.

In some embodiments, the step S283 includes steps S2831 to S2833, wherein:

and S2831, determining a third sub-loss value based on the first sub-characteristics and the label information.

Here, the tag information may include, but is not limited to, a tag value, an identification, and the like. The third sub-penalty value may include, but is not limited to, a cross-entropy penalty value. In some embodiments, the third sub-loss value may be calculated by the above equation (1-1), when f in equation (1-1)_iIs the first sub-feature.

And S2832, determining a fourth sub-loss value based on the second sub-characteristics and the label information.

Here, the fourth sub-penalty value may include, but is not limited to, a cross-entropy penalty value. In some embodiments, the fourth sub-loss value may be calculated by the above equation (1-1), when f in equation (1-1)_iIs the second sub-feature.

Step S2833 determines a fourth loss value based on the third sub-loss value and the fourth sub-loss value.

Here, the fourth loss value may include, but is not limited to, a sum between the third sub-loss value and the fourth sub-loss value, a sum after the third sub-loss value and the fourth sub-loss value are weighted, respectively, and the like. In practice, a person skilled in the art may determine the fourth loss value according to actual requirements, and the embodiment of the disclosure is not limited.

In an embodiment of the disclosure, the fourth loss value is determined based on the first sub-feature, the second sub-feature and the tag information. In this way, the accuracy of the fourth loss value can be improved so as to accurately judge whether the first model converges.

In some embodiments, the step S284 includes steps S2841 to S2844, wherein:

step S2841, determining a third feature center of the first object and a fourth feature center of the at least one second object from the at least one feature of the at least one object in the second feature memory library.

Here, the second feature memory bank stores at least one feature of at least one first object and at least one feature of at least one second object.

In some embodiments, the third feature center may be determined based on the feature of the first object in the second feature memory bank and the first sub-feature. Each fourth feature center may be determined based on each feature of each second object in the second feature memory library.

In some embodiments, the feature center of each object may be calculated by the following equation (2-5):

wherein, c_xRepresenting the center of the feature of the x-th object, B_kRepresenting a set of features belonging to the kth object in the minibatch, m being the set updated momentum coefficient, f_i' is the first sub-feature of the ith sample. In some embodiments, m may be 0.2.

In some embodiments, at f_i' and B_kAll belonging to the same object, the feature center c of the object_kWill vary at f_i' and B_kIn the case of not belonging to the same object, the feature center c belonging to that object_kC from the previous time_kAre consistent.

Step S2842, determining a fifth sub-loss value based on the first sub-feature, the third feature center and each fourth feature center.

Here, the fifth sub-loss value may include, but is not limited to, a contrast loss and the like.

In some embodiments, the fifth sub-loss value may be calculated by the following equation (3-6):

where τ is a predefined temperature parameter, c_yThird feature center representing the y-th object, c_zDenotes the z-th fourth feature center, f_iRepresenting a first sub-feature, ID, of the ith object_SRepresenting the total number of subjects in the training set.

Step S2843, determining a sixth sub-loss value based on the second sub-feature, the third feature center, and each fourth feature center.

Here, the sixth sub-loss value may include, but is not limited to, a contrast loss, and the like. The sixth sub-loss value may be determined in the same manner as the fifth sub-loss value, see step S2842.

Step S2844 determines a sixth loss value based on the fifth sub-loss value and the sixth sub-loss value.

Here, the sixth loss value may include, but is not limited to, a sum between the fifth sub-loss value and the sixth sub-loss value, a sum after the fifth sub-loss value and the sixth sub-loss value are weighted, respectively, and the like. In practice, a person skilled in the art may determine the sixth loss value according to actual requirements, and the embodiment of the disclosure is not limited.

In an embodiment of the disclosure, the sixth loss value is determined based on the first sub-feature, the second sub-feature and the feature of the other object. In this way, the accuracy of the sixth loss value can be improved so as to accurately judge whether the first model converges.

In some embodiments, the second network comprises a fifth sub-network and a sixth sub-network, and the step S23 comprises steps S231 to S232, wherein:

step S231, using a fifth sub-network to respectively aggregate the first sub-feature and the second sub-feature with the second feature of the at least one second object, so as to obtain a first aggregated sub-feature corresponding to the first sub-feature and a second aggregated sub-feature corresponding to the second sub-feature.

Here, the second network comprises at least a fifth sub-network for aggregating the first sub-feature with the second feature of the at least one second object to obtain a first aggregated sub-feature, and aggregating the second sub-feature with the second feature of the at least one second object to obtain a second aggregated sub-feature.

Step S232, determining, by the sixth sub-network, the first target sub-feature based on the first aggregation sub-feature, and the second target sub-feature based on the second aggregation sub-feature.

Here, the second network further comprises a sixth sub-network for determining the first target sub-feature based on the first aggregation sub-feature and the second target sub-feature based on the second aggregation sub-feature.

In the embodiment of the disclosure, the feature of the second object is introduced as noise in the feature level of the first image sample containing the first object, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced, the performance of the first model can be improved, and the trained first model can accurately re-identify the object in the image containing a plurality of objects.

In some embodiments, the step S231 includes steps S2311 to S2314, wherein:

step S2311, a first attention matrix is determined based on the first sub-feature and each of the second features.

Here, the first attention moment matrix is used to characterize the degree of association between the first sub-feature and each of the second features.

In some embodiments, based on the first sub-feature, X second features belonging to at least one second object are determined, X being a positive integer. In some embodiments, X may take 10.

In some embodiments, the K-nearest neighbor algorithm may be used to search the second feature memory bank for X second features belonging to the second object that are nearest to the first sub-features, and based on each second feature, X first centers may be determined

. In the search, a calculation can be made based on the cosine distance between features.

In some embodiments, the network parameters of the fifth sub-network include a first prediction matrix and a second prediction matrix, and the step S2311 includes steps S2321 to S2323, in which:

step S2321, a first prediction characteristic is determined based on the first sub-characteristic and the first prediction matrix.

In some embodiments, the first predicted characteristic may be calculated by the following equation (2-7):

f_q＝f′W₁ (2-7)；

wherein f' represents a first sub-feature,

d and d 'are both characteristic dimensions of f'.

Step S2322, second prediction characteristics are determined based on each second characteristic and the second prediction matrix.

In some embodiments, the second predicted characteristic may be calculated by the following equation (2-8):

wherein the content of the first and second substances,

represents the ith first center, i ∈ 1,2, … … X,

d and d' are both feature dimensions of the first sub-feature.

Step S2323, a first attention matrix is determined based on the first predicted characteristic and each second predicted characteristic.

In some embodiments, the first attention matrix may be determined by the following equation (2-9):

wherein X represents the total number of second features, i ∈ 1,2, … … X,

is a scale factor.

Step S2312, a first aggregated sub-feature is determined based on each second feature and each first attention matrix.

In some embodiments, the network parameters of the fifth sub-network further include a third prediction matrix, and the step S2312 includes steps S2331 through S2332, wherein:

step S2331, a third prediction feature is determined based on each second feature and the third prediction matrix.

In some embodiments, the third predicted characteristic may be calculated by the following equation (2-10):

wherein the content of the first and second substances,

represents the ith first center, i e 1,2, … … X,

d and d' are both feature dimensions of the first sub-feature.

In step S2332, a first aggregate sub-feature is determined based on each third predictive feature and each first attention matrix.

In some embodiments, the first aggregate sub-feature may be determined by the following equation (2-11):

wherein m is_iDenotes the ith first attention matrix, f_viRepresenting the ith third predicted feature.

Step S2313, a second attention matrix is determined based on the second sub-feature and each second feature.

Here, the second attention moment matrix is used to characterize the degree of association between the second sub-feature and each of the second features. In practice, the manner of determining the second attention matrix may be the same as the manner of determining the first attention matrix, see steps S2321-S2323.

Step S2314, determining the second aggregate sub-feature based on each of the second features and each of the second attention matrices.

Here, the manner of determining the second aggregation sub-feature may be the same as that of determining the first aggregation sub-feature, see step S2331 through step S2332.

In the embodiment of the disclosure, each first center is divided into a plurality of parts through a multi-head operation, and attention weight is allocated to each part, so that more unique patterns similar to a target object and a non-target object can be aggregated, the robustness of the first model is enhanced, and the trained first model can accurately re-identify an object in an image containing a plurality of objects.

In some embodiments, the sixth sub-network comprises a seventh sub-network and an eighth sub-network, and the step S232 comprises steps S2341 to S2343, wherein:

step S2341 determines an occlusion mask based on the first sub-image and the second sub-image.

Here, the occlusion mask is used to represent occlusion information of an image. In some implementations, the occlusion mask can be determined based on a pixel difference between the first sub-image and the second sub-image.

Step S2342, determining, by the seventh sub-network, a fifth sub-feature based on the first aggregation sub-feature and the occlusion mask, and a sixth sub-feature based on the second aggregation sub-feature and the occlusion mask.

Here, the seventh sub-network may be an FFN comprising two fully connected layers and one activation function₁(. a) a neural network.

In some embodiments, the fifth or sixth sub-feature may be obtained by the following equations (2-12):

f″＝mask·FFN₁(f_d) (2-12)；

where mask is an occlusion mask, f_dIs a first polymeric sub-feature or a second polymeric sub-feature.

Step S2343, determining, by the eighth sub-network, the first target sub-feature based on the first sub-feature and the fifth sub-feature, and determining the second target sub-feature based on the second sub-feature and the sixth sub-feature.

Here, the eighth sub-network may be an FFN comprising two fully connected layers and one activation function₂(. a) a neural network.

In some embodiments, the first target sub-feature or the second target sub-feature may be obtained by the following equations (2-13):

f_d′＝FFN₂(f″+f′) (2-13)；

wherein f "is the fifth or sixth sub-feature and f' is the first or second sub-feature.

In the embodiment of the disclosure, the target feature is obtained based on the occlusion mask, the first sub-feature and the first aggregation sub-feature, and it can be ensured that the features of other objects are only added to the human body part of the first object and not to the pre-identified object occlusion part, so as to better simulate the multi-pedestrian image features.

Fig. 3 is a schematic flow chart of an implementation of a model training method provided in an embodiment of the present disclosure, and as shown in fig. 3, the method includes steps S31 to S37:

step S31, a first image sample containing a first object is acquired.

Step S32, extracting the features of the first image sample by using the first network of the first model to be trained to obtain the first features of the first object.

Step S33, updating the first feature based on the second feature of at least one second object by using the second network of the first model to obtain a first target feature corresponding to the first feature, where a similarity between each second object and the first object is not less than a first threshold.

Step S34, a target loss value is determined based on the first target feature.

And step S35, updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

Here, the steps S31 to S35 correspond to the steps S11 to S15, respectively, and in the implementation, specific embodiments of the steps S11 to S15 may be referred to.

Step S36, an initial second model is determined based on the trained first model.

Here, the trained network of the first model may be adjusted according to an actual usage scenario, and the adjusted first model may be determined as the initial second model.

In some embodiments, the first model includes a first network and a second network, the second network in the trained first model may be removed, the first network of the first model may be adjusted according to an actual scene, and the adjusted first model may be determined as an initial second model.

And step S37, updating the model parameters of the second model based on at least one second image sample to obtain the trained second model.

Here, the second image sample may have label information or may be label-free information. In practice, the skilled person can determine the appropriate second image sample according to the actual application scenario, and is not limited herein.

In some embodiments, the model parameters of the second model may be fine-tuned and trained based on at least one second image sample, resulting in the trained second model.

In the embodiment of the disclosure, an initial second model is determined based on the trained first model, and model parameters of the second model are updated based on at least one second image sample to obtain the trained second model. Therefore, the model parameters of the trained first model can be transferred to the second model, so that the method is suitable for various application scenarios, the calculated amount can be reduced in practical application, and the training efficiency of the second model and the detection accuracy of the trained second model can be improved.

Fig. 4 is an image recognition method provided in the embodiment of the present disclosure, and as shown in fig. 4, the method includes steps S41 to S42, where:

step S41, acquiring a first image and a second image.

Here, the first image and the second image may be any suitable images to be recognized. In implementation, a person skilled in the art may select a suitable image according to an actual application scenario, and the embodiment of the disclosure is not limited.

In some embodiments, the first image may include an image with occlusion, and may also include an image without occlusion.

In some embodiments, the source of the first image and the second image may be the same or different.

For example, the first image and the second image are both images taken by a camera. For another example, the first image may be an image captured by a camera, and the second image may be an image of a frame in a video.

And step S42, recognizing the object in the first image and the object in the second image by using the trained target model to obtain a recognition result.

Here, the trained target model may include, but is not limited to, at least one of the first model and the second model. The recognition result represents that the object in the first image and the object in the second image are the same object or different objects.

In some embodiments, a first target feature corresponding to the first image and a second target feature corresponding to the second image are respectively obtained based on the target model, and the recognition result is obtained based on a similarity between the first target feature and the second target feature.

In the embodiment of the disclosure, because the model training method described in the above embodiment may introduce real noise at a characteristic level, or introduce real noise at both a picture level and a characteristic level, train the overall network structure of the target model, enhance the robustness of the target model, and effectively improve the performance of the target model, the image is recognized based on the first model and/or the second model obtained by using the model training method in the above embodiment, so that the pedestrian can be re-recognized more accurately.

Fig. 5A is a schematic structural diagram of a model training system 50 provided in an embodiment of the present disclosure, and as shown in fig. 5A, the model training system 50 includes an augmentation module 51, a blocking and erasing module 52, a feature diffusion module 53, an update module 54, and a feature memory library module 55, where:

the augmentation module 51 is configured to perform at least occlusion processing on the first sub-image including the first object to obtain a second sub-image.

And the occlusion erasing module 52 is configured to perform feature extraction on the first sub-image by using the first network of the first model to be trained to obtain a first sub-feature of the first object, and perform feature extraction on the second sub-image to obtain a second sub-feature of the first object.

The feature diffusion module 53 is configured to update the first sub-feature and the second sub-feature based on a second feature of at least one second object by using a second network of the first model, to obtain a first target sub-feature corresponding to the first sub-feature and a second target sub-feature corresponding to the second sub-feature, where a similarity between each second object and the first object is not less than a first threshold.

An update module 54 for determining a target loss value based on the first target sub-feature and the second target sub-feature; and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

A feature memory library module 55 for storing at least one feature of at least one object.

In some embodiments, the feature repository module 55 comprises a first feature repository for storing a first sub-feature of the at least one object and a second feature repository for storing a first target sub-feature of the at least one object.

Fig. 5B is a schematic diagram of a model training system 500 according to an embodiment of the disclosure, as shown in fig. 5B, the model training system 500 performs augmentation processing on the input first image 501 to obtain a second image 502, inputs the first image 501 and the second image 502 into the occlusion erasure module 52 to obtain a first sub-feature f1 'and a second sub-feature f 2', respectively, updates the second feature memory base 552 based on the first sub-feature f1 ', inputs the first sub-feature f 1', the second sub-feature f2 'and at least one feature of at least one other object selected from the second feature memory base 552 into the feature diffusion module 53 to obtain a first target sub-feature fd 1' and a second target sub-feature fd2 ', respectively, updates the first feature memory base 551 and network parameters in the occlusion erasure module 52 and the feature diffusion module 53 based on the first target sub-feature fd 1'.

In some embodiments, the augmentation module 51 is further configured to: based on the first sub-image and the second sub-image, an occlusion mask is determined.

Fig. 5C is a schematic diagram of determining an occlusion mask according to an embodiment of the present disclosure, as shown in fig. 5C, a pixel comparison operation 503 is performed between the first sub-image 501 and the second sub-image 502, after the pixel comparison operation 503, a binarization operation 504 is performed on a comparison result, and after the binarization operation 504 is performed, a corresponding occlusion mask 505 is obtained.

In some embodiments, the first network includes a first sub-network and a second sub-network, and the occlusion wipe module 52 is further configured to: respectively extracting the features of the first sub-image and the second sub-image by using a first sub-network of a first model to be trained to obtain a third sub-feature corresponding to the first sub-image and a fourth sub-feature corresponding to the second sub-image; a second sub-network of the first model is utilized, the first sub-feature is determined based on the third sub-feature, and the second sub-feature is determined based on the fourth sub-feature.

Fig. 5D is a schematic diagram of a first network 510 according to an embodiment of the disclosure, and as shown in fig. 5D, the first network 510 includes a first sub-network 511 and a second sub-network 512, the first sub-network 511 and the second sub-network 502 are input into the first sub-network 511 to obtain a third sub-feature f1 corresponding to the first sub-network 501, and a fourth sub-feature f2 corresponding to the second sub-network 502, and the third sub-feature f1 and the fourth sub-feature f2 are input into the second sub-network 512 to obtain a first sub-feature f1 'and a second sub-feature f 2'.

In some embodiments, the second sub-network includes a third sub-network and a fourth sub-network, and the occlusion erasure module 52 is further configured to: determining, with a third sub-network of the first model, a first occlusion score based on the third sub-feature and a second occlusion score based on the fourth sub-feature; determining, with the fourth sub-network, the first sub-feature based on the third sub-feature and the first occlusion score, and the second sub-feature based on the fourth sub-feature and the second occlusion score.

Fig. 5E is a schematic diagram of a second sub-network 512 provided in an embodiment of the disclosure, and as shown in fig. 5E, the second sub-network 512 includes a third sub-network 521 and a fourth sub-network 522, the third sub-feature f1 and the fourth sub-feature f2 are input into the third sub-network 521, a first occlusion score s1 corresponding to the third sub-feature f1 and a second occlusion score s2 corresponding to the fourth feature f2 are obtained, the first occlusion score s1 and the third sub-feature f1 are input into the fourth sub-network 522, a first sub-feature f1 'is obtained, and the second occlusion score s2 and the fourth sub-feature f2 are input into the fourth sub-network 522, a second sub-feature f 2' is obtained.

In some embodiments, the second network comprises a fifth subnetwork and a sixth subnetwork, and the feature diffusion module 53 is further configured to: aggregating the first sub-feature and the second sub-feature with a second feature of at least one second object respectively by using a fifth sub-network to obtain a first aggregated sub-feature corresponding to the first sub-feature and a second aggregated sub-feature corresponding to the second sub-feature; with the sixth sub-network, a first target sub-feature is determined based on the first aggregated sub-feature and a second target sub-feature is determined based on the second aggregated sub-feature.

Fig. 5F is a schematic diagram of a second network 520 according to an embodiment of the disclosure, and as shown in fig. 5F, the second network 520 includes a fifth sub-network 521 and a sixth sub-network 522, when the first sub-feature F1' is input into the fifth sub-network 521, the fifth sub-network 521 searches K nearest first centers belonging to the second object in the second feature memory 552 based on the first sub-feature F1 ″

Based on the first sub-feature f 1' and the first prediction matrix W₁Determining a first predictorSign f_qBased on the first center

And a second prediction matrix W₂Determining a second predicted characteristic fc based on the first center

And a third prediction matrix W₃Determining a third prediction feature f_v. Based on the first predicted characteristic f_qAnd a second prediction feature f_cDetermining a first attention matrix m_iBased on the first attention matrix m_iAnd a third predictive feature f_vDetermining a first aggregate sub-feature f_d. Subjecting the first polymer sub-feature f_dInput to FFN₁(. cndot.) to obtain a fifth feature f ', the first sub-feature f1 ' and the fifth feature f ' are weighted and input to the sixth sub-network 522 to obtain the first target sub-feature f_d′。

In some embodiments, the feature diffusion module 53 is further configured to: determining a first attention matrix based on the first sub-feature and each second feature, wherein the first attention matrix is used for characterizing the correlation degree between the first sub-feature and each second feature; determining a first aggregate sub-feature based on each second feature and each first attention matrix; determining a second attention matrix based on the second sub-features and each second feature, wherein the second attention matrix is used for characterizing the association degree between the second sub-features and each second feature; based on each second feature and each second attention matrix, a second aggregate sub-feature is determined.

In some embodiments, the network parameters of the fifth sub-network include a first prediction matrix and a second prediction matrix, and the feature diffusion module 53 is further configured to: determining a first prediction feature based on the first sub-feature and the first prediction matrix; determining a second prediction feature based on each second feature and the second prediction matrix; the first predictive feature and each of the second predictive features determine a first attention matrix.

In some embodiments, the network parameters of the fifth sub-network include a third prediction matrix, and the feature diffusion module 53 is further configured to: determining a third prediction feature based on each second feature and the third prediction matrix; based on each third predictive feature and each first attention matrix, a first aggregate sub-feature is determined.

In some embodiments, the sixth sub-network comprises a seventh sub-network and an eighth sub-network, and the feature diffusion module 53 is further configured to: determining, with the seventh sub-network, a fifth sub-feature based on the first aggregated sub-feature and the occlusion mask, and a sixth sub-feature based on the second aggregated sub-feature and the occlusion mask; determining, with the eighth sub-network, the first target sub-feature based on the first sub-feature and the fifth sub-feature, and the second target sub-feature based on the second sub-feature and the sixth sub-feature.

In some embodiments, the update module 54 is further configured to: determining a first target loss value based on the first target sub-feature and the second target sub-feature; determining a second target loss value based on the first sub-feature and the second sub-feature; determining a target loss value based on the first target loss value and the second target loss value; and updating the model parameters of the first model at least once based on the target loss value to obtain the trained first model.

In some embodiments, the update module 54 is further configured to: under the condition that the target loss value does not meet the preset condition, updating the model parameters of the first model to obtain an updated first model, and determining the trained first model based on the updated first model; and under the condition that the target loss value meets a preset condition, determining the updated first model as the trained first model.

In some embodiments, the update module 54 is further configured to: determining a first target sub-loss value based on the first sub-feature and the second sub-feature; determining a second target sub-loss value based on the third sub-feature and the fourth sub-feature; a second target penalty value is determined based on the first target sub-penalty value and the second target sub-penalty value.

In some embodiments, the first sub-image includes label information, the first model includes a second feature memory bank including at least one feature belonging to the at least one object, and the update module 54 is further configured to: determining a third penalty value based on the first occlusion score, the second occlusion score, and the occlusion mask; determining a fourth loss value based on the first sub-feature, the second sub-feature and the label information; determining a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature memory; a first target sub-loss value is determined based on the third loss value, the fourth loss value, and the fifth loss value.

In some embodiments, the update module 54 is further configured to: determining a first sub-penalty value based on the first occlusion score and the occlusion mask; determining a second sub-penalty value based on the second occlusion score and the occlusion mask; a third penalty value is determined based on the first sub-penalty value and the second sub-penalty value.

In some embodiments, the update module 54 is further configured to: determining a third sub-loss value based on the first sub-feature and the tag information; determining a fourth sub-loss value based on the second sub-feature and the tag information; a fourth penalty value is determined based on the third sub-penalty value and the fourth sub-penalty value.

In some embodiments, the update module 54 is further configured to: determining a third feature center of the first object and a fourth feature center of the at least one second object from the at least one feature of the at least one object in the second feature memory library; determining a fifth sub-loss value based on the first sub-feature, the third feature center, and each fourth feature center; determining a sixth sub-loss value based on the second sub-feature, the third feature center, and each fourth feature center; a fifth loss value is determined based on the fifth sub-loss value and the sixth sub-loss value.

In some embodiments, the update module 54 is further configured to: determining a seventh sub-loss value based on the third sub-feature and the tag information; determining an eighth sub-loss value based on the fourth sub-feature and the tag information; a second target sub-penalty value is determined based on the seventh sub-penalty value and the eighth sub-penalty value.

Fig. 5G is a schematic diagram of obtaining a target loss value 540 according to an embodiment of the present disclosure, as shown in fig. 5G, the target loss value 540 mainly includes loss values of three parts, namely, the feature extraction module, the occlusion erasing module 52, and the feature diffusion module 53, where:

the loss values for this portion of feature extraction include:

a seventh Loss value Loss7 determined based on the third sub-feature f1 and the label information of the first sub-image 501, and an eighth Loss value Loss8 determined based on the fourth sub-feature f2 and the label information of the first sub-image 501;

the loss values that mask out this portion of erase module 52 include:

a first sub-penalty value Loss31 determined on the basis of the occlusion mask 541 and the first occlusion score s1, a second sub-penalty value Loss32 determined on the basis of the occlusion mask 541 and the second occlusion score s 2;

a third sub-Loss value Loss41 determined based on the first sub-feature f1 'and the label information of the first sub-image 501, and a fourth sub-Loss value Loss42 determined based on the second sub-feature f 2' and the label information of the first sub-image 501;

a fifth sub-penalty value Loss51 determined based on the first sub-feature f1 'and the second feature repository 552, a sixth sub-penalty value Loss52 determined based on the second sub-feature f 2' and the second feature repository 552;

the loss values for this portion of feature diffusion module 53 include:

a ninth sub-Loss value Loss11 (corresponding to the first Loss value described above) determined based on the first target sub-feature fd1 'and the label information of the first sub-image 501, and a tenth sub-Loss value Loss12 (corresponding to the first Loss value described above) determined based on the second target sub-feature fd 2' and the label information of the first sub-image 501;

the eleventh Loss value Loss21 (corresponding to the second Loss value described above) determined based on the first target sub-feature fd1 'and the first feature memory base 551, and the twelfth Loss value Loss22 (corresponding to the second Loss value described above) determined based on the second target sub-feature fd 2' and the first feature memory base 551.

In some embodiments, the model training system further comprises: a second determination module and a third determination module; the second determining module is configured to determine an initial second model based on the trained first model; the third determining module is configured to update the model parameters of the second model based on at least one second image sample, so as to obtain the trained second model.

Compared with the method in the related art, the method provided by the embodiment of the disclosure has at least the following improvements:

1) in the related art, modeling of pedestrian re-identification (ReID) is mainly based on an attitude estimation algorithm or a human body analysis algorithm for auxiliary training. In the embodiment of the disclosure, modeling of pedestrian re-identification is to perform occlusion pedestrian re-identification by utilizing deep learning.

2) In the related art, in the modeling process of Pedestrian re-identification, the robustness of a model to shielding is mainly enhanced based on a random erasing mode, and the robustness of the model to Non-Pedestrian shielding (Non-Pedestrian interferences, NPO) is improved while characteristic interference from Non-Target Pedestrians (NTP) is ignored. In the embodiment of the present disclosure, in the modeling process of pedestrian re-identification, a Feature Erasing and diffusing Network (FED) is provided to simultaneously process NPO and NTP, specifically, NPO features are eliminated based on an Occlusion Erasing Module (OEM), and an NPO augmentation policy is supplemented to simulate NPO on the whole pedestrian image, so as to generate an accurate occlusion mask. Then, pedestrian features and other memory features are diffused based on a Feature Diffusion Module (FDM) to synthesize NTP features in a Feature space, NPO shielding interference is simulated on a picture level, NTP interference is simulated on a Feature level, the sensing capability of a model on Target Pedestrians (TP) can be greatly improved, and the influence of the NPO and the NTP is reduced.

The method provided by the embodiment of the disclosure has at least the following beneficial effects: 1) the blocking information of the picture and the characteristics of other pedestrians are fully utilized to simulate the blocking of non-pedestrians and the interference of non-target pedestrians, so that various influence factors can be comprehensively analyzed, and the perception capability of the model on TP is improved; 2) by utilizing deep learning, the result of pedestrian re-identification is more accurate, and the accuracy of pedestrian re-identification in a real complex scene is improved.

To better illustrate the beneficial effects of the embodiments of the present disclosure, experimental data of the methods provided by the embodiments of the present disclosure and the methods in the related art are described below in comparison.

(1) Data set

The three datasets of Occluded-DukeMTMC (O-Duke), Occluded-REID (O-REID), Partial-REID (P-REID) are ReID datasets with occlusion, and the two datasets of Market-1501 and DukeMTMC-REID are ReID datasets with few occlusions.

(2) Evaluation index

To ensure a fair comparison with existing pedestrian ReID methods, all methods were evaluated under Cumulative Matching Characteristics (CMC) and mean Average Precision (mAP). The CMC curves were used to evaluate the accuracy of human retrieval. The mAP is the average of all the average accuracies. All experiments were performed in a single query.

(3) Initialization of model part parameters

The input image is adjusted to 256 × 128. The first model was trained in an end-to-end fashion with a Stochastic steepest descent (SGD) optimizer with a momentum of 0.9 and a weight decay of le-4. The learning rate was initialized to 0.008 with a cosine learning rate decay. For each input branch, the batch size is 64, containing 16 identities and 4 samples per identity. All experiments were performed on two RTX 1080Ti GPUs. The temperature τ in the contrast loss was set to 0.05 and the number of heads in the FDM was set to 0.8.

For the NPO enhanced occlusion set, the occlusions are only clipped from the training data of O-Duke and used to augment all other data sets. This is because Market-1501 contains few occlusion images, whereas DukeMTMC-reID already contains much occlusion data in the training set.

(4) Results of the experiment

1) The method provided by the embodiment of the disclosure compares the result of the comparison with the existing method on the ReID data set with the occlusion

Table 1 compares the performance of each pedestrian ReiD method on three data sets, O-Duke, O-Reid and P-Reid. Since O-REID and P-REID have no corresponding training set, the test is performed using a model trained on Market-1501. The pedestrian ReId method includes: partial-based convolutional baselines (PCBs), Deep spatial feature reconstruction (DSR), High-Order Re-Identification (HOReID 27), partial-aware transform (PAT), translator-based ReID (Transformer-based Object Re-Identification, TransReID) using a Vision Transformer without a sliding window setting as a backbone, translator-based ViT baselines, which perform better on O-ReID and P-ReID datasets than TransReID because TransReID uses many dataset-specific markers.

TABLE 1 comparison of Performance of the respective methods on O-Duke, OREID and P-REID datasets

Comparing FEDs with existing methods, FEDs obtained the highest Rank-1 and mAP on both O-Duke and O-REID datasets. In particular, on the O-REID dataset, 86.3%/79.3% was achieved on Rank-1/mAP, at least 4.7%/2.6% over the other methods. On O-Duke, 68.1%/56.4% was achieved on Rank-1/mAP, at least 3.6%/0.7% over other methods. On the P-REID data set, the highest mAP accuracy rate is realized, which reaches 80.5 percent and exceeds other methods by 3.9 percent. Thus, a good representation is obtained on occluded ReID datasets.

2) The method provided by the embodiment of the disclosure compares the result of the ReID data set of the whole person with the existing method

Experiments were performed on the entire human ReID dataset, including mark-1501 and DukeMTMC-ReID. The MSE loss is not calculated when training is performed on the DukeMTMCreID dataset. This is because there are a large number of NPOs in the training set and an accurate occlusion mask cannot be obtained. The results are shown in Table 2. TransReID has no sliding window setting and the image size is 256 × 128. TransReID achieves better performance on the entire human data set than FED. This is because the TransReID is designed specifically for the ReID of the whole person and encodes camera information during the training process. In addition, however, FED also achieved 84.9% Rank-1 accuracy on DukeMTMC-reiD, exceeding other CNN-based methods and approaching TransReID.

TABLE 2 comparison of Performance of methods on Market-1501 and DukeMTMC-reiD datasets

3) Effectiveness of FED

In table 3, ablation studies of NPO Augmentation strategy (NPO Aug), OEM and FDM are presented. The numbers 1-5 represent baseline, baseline + NPO Aug + OEM, baseline + NPO Aug + FDM and FED, respectively. Model 1 uses ViT as a feature extractor and is optimized by cross entropy Loss (ID Loss) and Triplet Loss. By comparing model 1 (baseline) and model 2 (baseline + NPO Aug), there was a large improvement in Rank-1, reaching 4.9%, indicating that the augmented image is real and valuable. By comparing model 2 (baseline + NPO Aug) and model 3 (baseline + NPO Aug + OEM), the OEM can further improve the representation by removing potential NPO information. By comparing model 2 (baseline + NPO Aug) and model 4 (baseline + NPO Aug + FDM), FDM improved by 1.7% and 2.4% on Rank-1 and mAP, respectively. This means that optimizing the network with the diffusion feature can greatly improve the model's perceptibility to TPs. Finally, FEDs achieve the highest accuracy, indicating that each component can work individually and cooperatively.

TABLE 3 FED effectiveness

4) K-nearest neighbor analysis of feature memory library

Here, the number of searches K in the feature memory library search operation is analyzed. In Table 4, K is set to 2, 4, 6 and 8, and experiments are performed at DukeMTMC-reiD, Market-1501 and Occlude-DukeMTMC. The performance of the two ReID datasets for the whole person, DukeMTMC-reiD and Market-1501, appeared stable at various K's, floating to within 0.5%. For Market-1501, NPO and NTP are few and fail to highlight the effectiveness of FDM. For DukeMTMC-reiD, where a large amount of training data carries NPO and NTP, loss constraints can make the network highly accurate. For Occluded-DukeMTMC, since all training data are whole pedestrians, the introduction of FDM can greatly simulate the multi-pedestrian situation in the test set. With increasing K, FDM can better preserve the characteristics of the TP and introduce realistic noise.

TABLE 4K neighbor analysis

5) Qualitative analysis of FED

Fig. 5H is a schematic diagram of the occlusion scores of a pedestrian image provided by an embodiment of the present disclosure, and in fig. 5H, the occlusion scores of some pedestrian images from OEMs are shown. An image with NPO and non-target pedestrian NTP is displayed. As can be seen in fig. 5H, for the

graphs

551 and 552 where there is vertical object occlusion, the occlusion score is hardly affected, since less than half of the symmetrical pedestrians occluded are not critical for the pedestrian ReID. For the

graphs

553 and 554 where there is a horizontal occlusion, the OEM may accurately identify the NPO and mark it with a smaller occlusion score. For the multiple-row human

image occlusion graphs

555 and 556, the OEM identifies each stripe as valuable. Therefore, subsequent FDM is critical to improve model performance.

6) Search result examples using feature and distribution characterization

Fig. 5I is a schematic diagram of a result of image retrieval according to an embodiment of the present disclosure, and as shown in fig. 5I, a retrieval result of TransReID and FED is shown. Fig. 561 and 562 are object occlusion images. Obviously, the FED has better recognition capability for the NPO, and the target pedestrian can be accurately retrieved. Fig. 563 and 564 are multi-line human images, and FED has a stronger perception capability for TP and achieves higher retrieval accuracy.

Based on the foregoing embodiments, an embodiment of the present disclosure provides a model training device, and fig. 6 is a schematic diagram illustrating a composition structure of the model training device provided in the embodiment of the present disclosure, as shown in fig. 6, the model training device 60 includes a first obtaining module 61, a feature extracting module 62, a first updating module 63, and a second updating module 64.

A first acquiring module 61, configured to acquire a first image sample containing a first object;

the feature extraction module 62 is configured to perform feature extraction on the first image sample by using a first network of a first model to be trained to obtain a first feature of the first object;

a first updating module 63, configured to update the first features based on second features of at least one second object by using a second network of the first model, respectively, to obtain first target features corresponding to the first features, where a similarity between each second object and the first object is not less than a first threshold;

a first determination module 64 for determining a target loss value based on the first target characteristic;

and a second updating module 65, configured to update the model parameters of the first model at least once based on the target loss value, so as to obtain the trained first model.

In some embodiments, the first image sample includes label information, the first model includes a first feature memory bank including at least one feature belonging to the at least one object, and the first determining module 64 is further configured to: determining a first loss value based on the first target feature and the tag information; determining a second loss value based on the first target feature and at least one feature of at least one object in the first feature repository; a target loss value is determined based on the first loss value and the second loss value.

In some embodiments, the first determining module 64 is further configured to: determining a first feature center of the first object and a second feature center of the at least one second object from the at least one feature of the at least one object in the first feature memory library; a second loss value is determined based on the first target feature, the first feature center, and each of the second feature centers.

In some embodiments, the first feature repository comprises feature sets belonging to at least one object, each feature set comprising at least one feature of the object, and the apparatus further comprises: and the third updating module is used for updating the feature set belonging to the first object in the first feature memory base based on the first target feature.

In some embodiments, the first obtaining module 61 is further configured to: acquiring a first sub-image and a second sub-image containing a first object, wherein the second sub-image is an image obtained by at least shielding the first sub-image; a feature extraction module 62, further configured to: performing feature extraction on the first sub-image by using a first network of a first model to be trained to obtain a first sub-feature of the first object, and performing feature extraction on the second sub-image to obtain a second sub-feature of the first object; a first updating module 63, further configured to: respectively updating the first sub-feature and the second sub-feature based on the second feature of at least one second object by using a second network of the first model to obtain a first target sub-feature corresponding to the first sub-feature and a second target sub-feature corresponding to the second sub-feature; the first determining module 64 is further configured to: a target loss value is determined based on the first target sub-feature and the second target sub-feature.

In some embodiments, the first determining module 64 is further configured to: determining a first target loss value based on the first target sub-feature and the second target sub-feature; determining a second target loss value based on the first sub-feature and the second sub-feature; a target loss value is determined based on the first target loss value and the second target loss value.

In some embodiments, the first obtaining module 61 is further configured to: acquiring a first sub-image containing a first object; and at least carrying out occlusion processing on the first sub-image based on a preset occlusion set to obtain a second sub-image, wherein the occlusion set comprises at least one occlusion image.

In some embodiments, the first network includes a first subnetwork and a second subnetwork, and the feature extraction module 62 is further configured to: respectively extracting the features of the first sub-image and the second sub-image by using a first sub-network of a first model to be trained to obtain a third sub-feature corresponding to the first sub-image and a fourth sub-feature corresponding to the second sub-image; a second sub-network of the first model is utilized, the first sub-feature is determined based on the third sub-feature, and the second sub-feature is determined based on the fourth sub-feature.

In some embodiments, the first determining module 64 is further configured to: determining a first target sub-loss value based on the first sub-feature and the second sub-feature; determining a second target sub-loss value based on the third sub-feature and the fourth sub-feature; a second target penalty value is determined based on the first target sub-penalty value and the second target sub-penalty value.

In some embodiments, the first sub-image includes label information, and the first determining module 64 is further configured to: determining a seventh sub-loss value based on the third sub-feature and the tag information; determining an eighth sub-loss value based on the fourth sub-feature and the tag information; a second target sub-penalty value is determined based on the seventh sub-penalty value and the eighth sub-penalty value.

In some embodiments, the second sub-network includes a third sub-network and a fourth sub-network, and the feature extraction module 62 is further configured to: determining, with a third sub-network of the first model, a first occlusion score based on the third sub-feature and a second occlusion score based on the fourth sub-feature; determining, with the fourth sub-network, the first sub-feature based on the third sub-feature and the first occlusion score, and the second sub-feature based on the fourth sub-feature and the second occlusion score.

In some embodiments, the third subnetwork comprises a pooling subnetwork and at least one occlusion wipe subnetwork, the first occlusion score comprising at least one first occlusion sub-score and the second occlusion score comprising at least one second occlusion sub-score, the feature extraction module 62 further for: dividing the third sub-feature into at least one third sub-portion feature and the fourth sub-feature into at least one fourth sub-portion feature using the pooling sub-network; with each occlusion erase sub-network, each first occlusion sub-score is determined based on each third sub-portion feature, and each second occlusion sub-score is determined based on each fourth sub-portion feature.

In some embodiments, the feature extraction module 62 is further configured to: determining, with the fourth sub-network, first sub-portion features based on each third sub-portion feature of the third sub-features and each first occlusion sub-score, and second sub-portion features based on each fourth sub-portion feature of the fourth sub-features and each second occlusion sub-score; a first sub-feature is determined based on each first sub-portion feature and a second sub-feature is determined based on each second sub-portion feature.

In some embodiments, the first sub-image includes label information, the first model includes a second feature memory bank including at least one feature belonging to the at least one object, the first determination module 64 is further configured to: determining an occlusion mask based on the first sub-image and the second sub-image; determining a third penalty value based on the first occlusion score, the second occlusion score, and the occlusion mask; determining a fourth loss value based on the first sub-feature, the second sub-feature and the label information; determining a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature memory; a first target sub-loss value is determined based on the third loss value, the fourth loss value, and the fifth loss value.

In some embodiments, the first determining module 64 is further configured to: dividing the first sub-image and the second sub-image into at least one first sub-portion image and at least one second sub-portion image; determining an occlusion sub-mask based on each first sub-portion image and each second sub-portion image; based on each occlusion sub-mask, an occlusion mask is determined.

In some embodiments, the first determining module 64 is further configured to: determining a first sub-penalty value based on the first occlusion score and the occlusion mask; determining a second sub-penalty value based on the second occlusion score and the occlusion mask; a third loss value is determined based on the first sub-loss value and the second sub-loss value.

In some embodiments, the first determining module 64 is further configured to: determining a third sub-loss value based on the first sub-feature and the tag information; determining a fourth sub-loss value based on the second sub-feature and the tag information; a fourth penalty value is determined based on the third sub-penalty value and the fourth sub-penalty value.

In some embodiments, the first determining module 64 is further configured to: determining a third feature center of the first object and a fourth feature center of the at least one second object from the at least one feature of the at least one object in the second feature memory library; determining a fifth sub-loss value based on the first sub-feature, the third feature center, and each fourth feature center; determining a sixth sub-loss value based on the second sub-feature, the third feature center, and each fourth feature center; a fifth loss value is determined based on the fifth sub-loss value and the sixth sub-loss value.

In some embodiments, the second network comprises a fifth sub-network and a sixth sub-network, the first updating module 63 is further configured to: aggregating the first sub-feature and the second sub-feature with a second feature of at least one second object respectively by using a fifth sub-network to obtain a first aggregated sub-feature corresponding to the first sub-feature and a second aggregated sub-feature corresponding to the second sub-feature; with the sixth sub-network, a first target sub-feature is determined based on the first aggregated sub-feature and a second target sub-feature is determined based on the second aggregated sub-feature.

In some embodiments, the first updating module 63 is further configured to: determining a first attention matrix based on the first sub-features and each second feature, wherein the first attention matrix is used for representing the correlation degree between the first sub-features and each second feature; determining a first aggregate sub-feature based on each second feature and each first attention matrix; determining a second attention matrix based on the second sub-features and each second feature, wherein the second attention matrix is used for characterizing the correlation degree between the second sub-features and each second feature; based on each second feature and each second attention matrix, a second aggregate sub-feature is determined.

In some embodiments, the sixth sub-network comprises a seventh sub-network and an eighth sub-network, the first updating module 63 is further configured to: determining, with the seventh sub-network, a fifth sub-feature based on the first aggregated sub-feature and the occlusion mask, and a sixth sub-feature based on the second aggregated sub-feature and the occlusion mask; determining, with the eighth sub-network, the first target sub-feature based on the first sub-feature and the fifth sub-feature, and the second target sub-feature based on the second sub-feature and the sixth sub-feature.

Based on the above embodiments, an image recognition apparatus is provided in the embodiments of the present disclosure, fig. 7 is a schematic structural diagram of the image recognition apparatus provided in the embodiments of the present disclosure, and as shown in fig. 7, the image recognition apparatus 70 includes a second obtaining module 71 and a recognition module 72.

A second acquiring module 71, configured to acquire a first image and a second image;

a recognition module 72, configured to recognize the object in the first image and the object in the second image by using a trained target model, to obtain a recognition result, where the trained target model includes: a first model obtained by adopting the model training method; the recognition result represents that the object in the first image and the object in the second image are the same object or different objects.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be noted that, in the embodiment of the present disclosure, if the method is implemented in the form of a software functional module and sold or used as a standalone product, the method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present disclosure are not limited to any specific combination of hardware and software.

An embodiment of the present disclosure provides an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the above method when executing the computer program.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method. The computer readable storage medium may be transitory or non-transitory.

The disclosed embodiments provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program that when read and executed by a computer performs some or all of the steps of the above method. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It should be noted that fig. 8 is a schematic diagram of a hardware entity of an electronic device in an embodiment of the present disclosure, and as shown in fig. 8, the hardware entity of the electronic device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein:

the processor 801 generally controls the overall operation of the electronic device 800.

The communication interface 802 may enable the electronic device to communicate with other terminals or servers via a network.

The Memory 803 is configured to store instructions and applications executable by the processor 801, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 801 and modules in the electronic device 800, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM). Data may be transferred between the processor 801, the communication interface 802, and the memory 803 via the bus 804.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present disclosure may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure.

Claims

1. A method of model training, the method comprising:

obtaining a first image sample containing a first object;

determining a target loss value based on the first target feature;

2. The method of claim 1, wherein the first image sample comprises label information, wherein the first model comprises a first feature memory comprising at least one feature belonging to at least one object;

determining a target loss value based on the first target feature, comprising:

determining a first loss value based on the first target feature and the tag information;

determining a second loss value based on the first target feature and at least one feature of at least one object in the first feature memory library;

determining the target loss value based on the first loss value and the second loss value.

3. The method of claim 2, wherein determining a second loss value based on the first target feature and at least one feature of at least one object in the first feature repository comprises:

determining a first feature center of the first object and a second feature center of at least one second object from at least one feature of at least one object in the first feature memory library;

determining the second loss value based on the first target feature, the first feature center, and each of the second feature centers.

4. The method according to claim 2 or 3, wherein the first feature memory base comprises feature sets belonging to at least one object, each feature set comprising at least one feature of the object; the method further comprises the following steps:

and updating the feature set belonging to the first object in the first feature memory bank based on the first target feature.

5. The method according to any one of claims 1 to 4,

the acquiring a first image sample containing a first object, comprising: acquiring a first sub-image and a second sub-image containing a first object, wherein the second sub-image is an image obtained by at least shielding the first sub-image;

the extracting features of the first image sample by using the first network of the first model to be trained to obtain the first features of the first object includes: performing feature extraction on the first sub-image by using a first network of a first model to be trained to obtain a first sub-feature of the first object, and performing feature extraction on the second sub-image to obtain a second sub-feature of the first object;

the updating, by the second network using the first model, the first feature based on a second feature of at least one second object to obtain a first target feature corresponding to the first feature includes: respectively updating the first sub-feature and the second sub-feature based on a second feature of at least one second object by using a second network of the first model to obtain a first target sub-feature corresponding to the first sub-feature and a second target sub-feature corresponding to the second sub-feature;

determining a target loss value based on the first target feature, comprising: determining the target loss value based on the first target sub-feature and the second target sub-feature.

6. The method of claim 5, wherein determining the target loss value based on the first target sub-feature and the second target sub-feature comprises:

determining a first target loss value based on the first target sub-feature and the second target sub-feature;

determining a second target loss value based on the first sub-feature and the second sub-feature;

determining a target loss value based on the first target loss value and the second target loss value.

7. The method of claim 6, wherein obtaining the first sub-image and the second sub-image containing the first object comprises:

acquiring a first sub-image containing a first object;

and at least carrying out occlusion processing on the first sub-image based on a preset occlusion set to obtain the second sub-image, wherein the occlusion set comprises at least one occlusion image.

8. The method of claim 6 or 7, wherein the first network comprises a first sub-network and a second sub-network;

the performing feature extraction on the first sub-image by using the first network of the first model to be trained to obtain a first sub-feature of the first object, and performing feature extraction on the second sub-image to obtain a second sub-feature of the first object, includes:

respectively extracting the features of the first sub-image and the second sub-image by using a first sub-network of a first model to be trained to obtain a third sub-feature corresponding to the first sub-image and a fourth sub-feature corresponding to the second sub-image;

determining, with a second sub-network of the first model, the first sub-feature based on the third sub-feature and the second sub-feature based on the fourth sub-feature.

9. The method of claim 8, wherein determining a second target loss value based on the first sub-feature and the second sub-feature comprises:

determining a first target sub-penalty value based on the first sub-feature and the second sub-feature;

determining a second target sub-penalty value based on the third sub-feature and the fourth sub-feature;

determining the second target penalty value based on the first target sub-penalty value and the second target sub-penalty value.

10. The method of claim 9, wherein the first sub-image comprises label information;

said determining a second target sub-penalty value based on said third sub-feature and said fourth sub-feature comprises:

determining a seventh sub-loss value based on the third sub-feature and the tag information;

determining an eighth sub-loss value based on the fourth sub-feature and the tag information;

determining the second target sub-penalty value based on the seventh sub-penalty value and the eighth sub-penalty value.

11. The method of claim 9 or 10, wherein the second sub-network comprises a third sub-network and a fourth sub-network;

the determining, with the second sub-network of the first model, the first sub-feature based on the third sub-feature and the second sub-feature based on the fourth sub-feature comprises:

determining, with a third sub-network of the first model, a first occlusion score based on the third sub-feature and a second occlusion score based on the fourth sub-feature;

determining, with the fourth sub-network, the first sub-feature based on the third sub-feature and the first occlusion score, and the second sub-feature based on the fourth sub-feature and the second occlusion score.

12. The method of claim 11, wherein the third subnetwork comprises a pooling subnetwork and at least one occlusion wipe subnetwork, wherein the first occlusion score comprises at least one first occlusion sub-score, and wherein the second occlusion score comprises at least one second occlusion sub-score;

the determining, with the third sub-network of the first model, a first occlusion score based on the third sub-feature and a second occlusion score based on the fourth sub-feature comprises:

dividing, with the pooling sub-network, the third sub-feature into at least one third sub-portion feature and the fourth sub-feature into at least one fourth sub-portion feature;

determining, with each said occlusion erase sub-network, each said first occlusion sub-score based on each said third sub-portion feature and each said second occlusion sub-score based on each said fourth sub-portion feature.

13. The method of claim 12, wherein said determining, with the fourth sub-network, the first sub-feature based on the third sub-feature and the first occlusion score and the second sub-feature based on the fourth sub-feature and the second occlusion score comprises:

determining, with the fourth sub-network, first sub-portion features based on each of the third sub-portion features and each of the first occlusion sub-scores of the third sub-features, and second sub-portion features based on each of the fourth sub-portion features and each of the second occlusion sub-scores of the fourth sub-features;

determining the first sub-portion feature based on each of the first sub-portion features and determining the second sub-portion feature based on each of the second sub-portion features.

14. The method according to any one of claims 11 to 13, wherein the first sub-image comprises label information, the first model comprises a second feature repository comprising at least one feature belonging to at least one object;

said determining a first target sub-penalty value based on said first sub-feature and said second sub-feature comprises:

determining an occlusion mask based on the first sub-image and the second sub-image;

determining a third penalty value based on the first occlusion score, the second occlusion score, and the occlusion mask;

determining a fourth loss value based on the first sub-feature, the second sub-feature, and the tag information;

determining a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature repository;

determining the first target sub-loss value based on the third loss value, the fourth loss value, and the fifth loss value.

15. The method of claim 14, wherein determining an occlusion mask based on the first sub-image and the second sub-image comprises:

dividing the first sub-image and the second sub-image into at least one first sub-portion image and at least one second sub-portion image;

determining an occlusion sub-mask based on each of the first sub-portion images and each of the second sub-portion images;

determining the occlusion mask based on each of the occlusion sub-masks.

16. The method of claim 14 or 15, wherein determining a third penalty value based on the first occlusion score, the second occlusion score, and the occlusion mask comprises:

determining a first sub-penalty value based on the first occlusion score and the occlusion mask;

determining a second sub-penalty value based on the second occlusion score and the occlusion mask;

determining the third penalty value based on the first sub-penalty value and the second sub-penalty value.

17. The method of any of claims 14 to 16, wherein determining a fourth loss value based on the first sub-feature, the second sub-feature, and the tag information comprises:

determining a third sub-loss value based on the first sub-feature and the tag information;

determining a fourth sub-loss value based on the second sub-feature and the tag information;

determining the fourth penalty value based on the third sub-penalty value and the fourth sub-penalty value.

18. The method of any one of claims 14 to 17, wherein determining a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature repository comprises:

determining a third feature center of the first object and a fourth feature center of at least one second object from at least one feature of at least one object in the second feature memory library;

determining a fifth sub-loss value based on the first sub-feature, the third feature center, and each of the fourth feature centers;

determining a sixth sub-loss value based on the second sub-feature, the third feature center, and each of the fourth feature centers;

determining the fifth loss value based on the fifth sub-loss value and the sixth sub-loss value.

19. The method according to any of claims 14 to 18, wherein the second network comprises a fifth sub-network and a sixth sub-network;

the updating, by the second network using the first model, the first sub-feature and the second sub-feature based on a second feature of at least one second object, respectively, to obtain a first target sub-feature corresponding to the first sub-feature and a second target sub-feature corresponding to the second sub-feature, includes:

aggregating, by using the fifth sub-network, the first sub-feature and the second sub-feature with a second feature of at least one second object, respectively, to obtain a first aggregated sub-feature corresponding to the first sub-feature and a second aggregated sub-feature corresponding to the second sub-feature;

determining, with the sixth sub-network, the first target sub-feature based on the first aggregate sub-feature and the second target sub-feature based on the second aggregate sub-feature.

20. The method of claim 19, wherein the aggregating, by using the fifth sub-network, the first sub-feature and the second sub-feature with a second feature of at least one second object respectively to obtain a first aggregated sub-feature corresponding to the first sub-feature and a second aggregated sub-feature corresponding to the second sub-feature comprises:

determining a first attention matrix based on the first sub-feature and each of the second features, wherein the first attention matrix is used for characterizing the correlation degree between the first sub-feature and each of the second features;

determining the first aggregate sub-feature based on each of the second features and each of the first attention matrices;

determining a second attention matrix based on the second sub-features and each second feature, wherein the second attention matrix is used for characterizing the correlation degree between the second sub-features and each second feature;

determining the second aggregate sub-feature based on each of the second features and each of the second attention matrices.

21. The method of claim 19 or 20, wherein the sixth sub-network comprises a seventh sub-network and an eighth sub-network;

the determining, with the sixth sub-network, the first target sub-feature based on the first aggregate sub-feature and the second target sub-feature based on the second aggregate sub-feature comprises:

determining, with the seventh sub-network, a fifth sub-feature based on the first aggregated sub-feature and the occlusion mask, and a sixth sub-feature based on the second aggregated sub-feature and the occlusion mask;

determining, with the eighth sub-network, the first target sub-feature based on the first sub-feature and the fifth sub-feature, and the second target sub-feature based on the second sub-feature and the sixth sub-feature.

22. An image recognition method, characterized in that the method comprises:

acquiring a first image and a second image;

recognizing the object in the first image and the object in the second image by using a trained target model to obtain a recognition result, wherein the trained target model comprises: a first model obtained by the model training method according to any one of claims 1 to 21; the recognition result represents that the object in the first image and the object in the second image are the same object or different objects.

23. A model training apparatus, the apparatus comprising:

a first determination module to determine a target loss value based on the first target characteristic;

24. An image recognition apparatus, characterized in that the apparatus comprises:

a recognition module, configured to recognize, by using a trained target model, an object in the first image and an object in the second image to obtain a recognition result, where the trained target model includes: a first model obtained by the model training method according to any one of claims 1 to 21; the recognition result represents that the object in the first image and the object in the second image are the same object or different objects.

25. An electronic device comprising a processor and a memory, the memory storing a computer program operable on the processor, wherein the processor, when executing the computer program, implements the method of any one of claims 1 to 22.

26. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 22.