WO2023142551A1 - Procédés et appareils d'entraînement de modèle et de reconnaissance d'image, dispositif, support de stockage et produit-programme informatique - Google Patents

Procédés et appareils d'entraînement de modèle et de reconnaissance d'image, dispositif, support de stockage et produit-programme informatique Download PDF

Info

Publication number
WO2023142551A1
WO2023142551A1 PCT/CN2022/127109 CN2022127109W WO2023142551A1 WO 2023142551 A1 WO2023142551 A1 WO 2023142551A1 CN 2022127109 W CN2022127109 W CN 2022127109W WO 2023142551 A1 WO2023142551 A1 WO 2023142551A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
feature
loss value
image
target
Prior art date
Application number
PCT/CN2022/127109
Other languages
English (en)
Chinese (zh)
Inventor
唐诗翔
朱烽
赵瑞
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023142551A1 publication Critical patent/WO2023142551A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • the embodiment of the present disclosure is based on the Chinese patent application with the application number 202210107742.9, the application date is January 28, 2022, and the application name is "model training and image recognition method and device, equipment and storage medium", and requires the Chinese patent application
  • the entire content of this Chinese patent application is hereby incorporated into this disclosure as a reference.
  • the present disclosure relates to but not limited to the field of computer technology, and in particular relates to a model training and image recognition method and device, device, storage medium and computer program product.
  • Object re-identification also known as object re-identification
  • Object re-identification is a technology that uses computer vision technology to determine whether a specific object exists in an image or video sequence.
  • Object re-identification is widely considered as a subproblem of image retrieval, i.e., given an image containing an object, retrieve images containing that object across devices. The differences between devices, shooting angles, environments and other factors will all affect the results of object re-identification.
  • Embodiments of the present disclosure provide a model training and image recognition method, device, device, storage medium, and computer program product.
  • An embodiment of the present disclosure provides a model training method, which includes:
  • the first feature is updated to obtain the first target feature corresponding to the first feature, and the similarity between each second object and the first object not less than the first threshold;
  • the model parameters of the first model are updated at least once to obtain the trained first model.
  • An embodiment of the present disclosure provides an image recognition method, the method comprising:
  • the trained target model uses the trained target model to identify the object in the first image and the object in the second image to obtain the recognition result, wherein the trained target model includes: the first model obtained by the above-mentioned model training method; the recognition result representation The object in the first image and the object in the second image are the same object or different objects.
  • An embodiment of the present disclosure provides a model training device, which includes:
  • a first acquisition part configured to acquire a first image sample containing a first object
  • the feature extraction part is configured to use the first network of the first model to be trained to perform feature extraction on the first image sample to obtain the first feature of the first object;
  • the first update part is configured to use the second network of the first model to update the first features respectively based on the second features of at least one second object to obtain the first target features corresponding to the first features, and each of the second objects The similarity between the second object and the first object is not less than the first threshold;
  • the first determination part is configured to determine a target loss value based on the first target feature
  • the second updating part is configured to update the model parameters of the first model at least once based on the target loss value to obtain the trained first model.
  • An embodiment of the present disclosure provides an image recognition device, which includes:
  • a second acquisition part configured to acquire the first image and the second image
  • the identification part is configured to use the trained target model to identify the object in the first image and the object in the second image to obtain a recognition result, wherein the trained target model includes: the first obtained by using the above model training method A model; the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.
  • An embodiment of the present disclosure provides an electronic device, including a processor and a memory, the memory stores a computer program that can run on the processor, and the above method is implemented when the processor executes the computer program.
  • An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the foregoing method is implemented.
  • An embodiment of the present disclosure provides a computer program product, where the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on the electronic device, the electronic device is made to execute the above method.
  • the first network of the first model by acquiring the first image sample containing the first object; using the first network of the first model to be trained, performing feature extraction on the first image sample to obtain the first feature of the first object; using The second network of the first model updates the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, and the similarity between each second object and the first object is different. less than the first threshold; determining a target loss value based on the first target feature; and updating model parameters of the first model at least once based on the target loss value to obtain a trained first model.
  • the characteristics of the second object are introduced as noise at the feature level of the first image sample containing the first object, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced and the first model can be improved.
  • the model parameters of the first model are updated at least once. Since the target loss value is determined based on the first target feature, the first model after training can be improved. The consistency of the model's predictions for different image samples of the same object can further enable the trained first model to more accurately re-identify objects in images containing multiple objects.
  • FIG. 1 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of an implementation flow of an image recognition method provided by an embodiment of the present disclosure.
  • FIG. 5A is a schematic diagram of the composition and structure of a model training system provided by an embodiment of the present disclosure
  • FIG. 5B is a schematic diagram of a model training system provided by an embodiment of the present disclosure.
  • FIG. 5C is a schematic diagram of determining an occlusion mask provided by an embodiment of the present disclosure.
  • FIG. 5D is a schematic diagram of a first network provided by an embodiment of the present disclosure.
  • FIG. 5E is a schematic diagram of a second subnetwork provided by an embodiment of the present disclosure.
  • FIG. 5F is a schematic diagram of a second network provided by an embodiment of the present disclosure.
  • FIG. 5G is a schematic diagram of obtaining a target loss value provided by an embodiment of the present disclosure.
  • FIG. 5H is a schematic diagram of an occlusion score of a pedestrian image provided by an embodiment of the present disclosure.
  • FIG. 5I is a schematic diagram of an image retrieval result provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of the composition and structure of a model training device provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of the composition and structure of an image recognition device provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a hardware entity of an electronic device in an embodiment of the present disclosure.
  • the embodiment of the present disclosure provides a model training method, which introduces the features of the second object as noise at the feature level of the first image sample containing the first object, and trains the overall network structure of the first model, so that the first model can be enhanced Robustness and improve the performance of the first model, at the same time, when the target loss value does not meet the preset conditions, the model parameters of the first model are updated at least once, because the target loss value is determined based on the first target feature , so that the prediction consistency of the trained first model for different image samples of the same object can be improved, thereby enabling the trained first model to more accurately re-identify objects in images containing multiple objects.
  • Both the model training method and the image recognition method provided by the embodiments of the present disclosure can be executed by electronic equipment, and the electronic equipment can be a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital
  • a mobile device for example, a mobile phone, a portable music player, a personal digital
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, intermediate Cloud servers for basic cloud computing services such as mail service, domain name service, security service, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • CDN Content Delivery Network
  • Fig. 1 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 1, the method includes steps S11 to S15, wherein:
  • Step S11 acquiring a first image sample including a first object.
  • the first image sample may be any suitable image containing at least the first object.
  • Content contained in the first image sample may be determined according to an actual application scenario, for example, only the first object, or at least one of the first object and the object, or other objects.
  • the first object may include, but is not limited to, people, animals, plants, objects, and the like.
  • the first image sample is a face image containing Zhang San.
  • the first image sample is an image including Li Si's whole person.
  • the first image sample may include at least one image.
  • the first image sample is any image in the training set.
  • the first image sample includes a first sub-image and a second sub-image, wherein the first sub-image is an image in the training set, and the second sub-image is an image obtained by augmenting the first sub-image.
  • the augmentation processing may include but not limited to at least one of occlusion processing, scaling processing, cropping processing, size adjustment processing, filling processing, flipping processing, color dithering processing, grayscale processing, Gaussian blur processing, random erasing processing, etc. kind.
  • those skilled in the art may use appropriate augmentation processing on the first sub-image to obtain the second sub-image according to actual conditions, which is not limited in the embodiments of the present disclosure.
  • the first image sample includes a first sub-image and a plurality of second sub-images, wherein the first sub-image is an image in the training set, and each second sub-image is an augmentation process on the first sub-image image obtained after.
  • Step S12 using the first network of the first model to be trained, to perform feature extraction on the first image sample to obtain the first feature of the first object.
  • the first model may be any suitable model for object recognition based on image features.
  • the first model may include at least a first network.
  • the first feature may include, but not limited to, the original feature of the first image sample, or a feature obtained by processing the original feature.
  • the original feature may include but not limited to the face feature, body feature, etc. of the first object included in the image.
  • the first network may at least include a first sub-network, and the first sub-network is used to extract features of the first image using a feature extractor.
  • the feature extractor may include, but not limited to, a recurrent neural network (Recurrent Neural Network, RNN), a convolutional neural network (Convolutional Neural Network, CNN), a feature extraction network based on a converter (Transform), etc.
  • RNN recurrent neural network
  • CNN convolutional Neural Network
  • Transform a feature extraction network based on a converter
  • those skilled in the art may use an appropriate first network in the first model to obtain the first feature according to actual conditions, which is not limited in the embodiments of the present disclosure.
  • the third feature of the first image sample is extracted through the first sub-network, and the third feature is determined as the first feature of the first object.
  • the third feature may include, but not limited to, the original feature of the first image sample and the like.
  • the first network may further include a second sub-network for determining the first feature of the first object based on the third feature of the first image sample.
  • the second sub-network may include an occlusion erasure network, which is used to perform occlusion erasure processing on the input third feature to obtain the first feature of the first object.
  • Step S13 using the second network of the first model to update the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature.
  • the similarity between each second object and the first object is not less than the first threshold.
  • the first threshold may be preset or obtained by statistics. During implementation, those skilled in the art may independently determine the setting manner of the first threshold according to actual needs, which is not limited in the embodiments of the present disclosure.
  • the similarity between the facial features of the second object and the first object is not less than the first threshold.
  • the similarity between the wearing features of the second object and the first object is not less than the first threshold.
  • neither the similarity between the appearance characteristics of the second object nor the similarity between the clothing characteristics of the first object is less than the first threshold.
  • the second feature can be obtained based on the training set, or can be pre-input.
  • the second object may include, but is not limited to, people, animals, plants, objects, and the like.
  • the similarity between each second object and the first object may be obtained based on the similarity between the second feature of each second object and the first feature of the first object. In some implementations, the similarity between each second object and the first object may be obtained based on the similarity between the feature center of each second object and the first feature of the first object.
  • the first model may include a second memory feature library
  • the second memory feature library may include at least one feature of at least one object. The feature center of the second object may be obtained based on at least one feature belonging to the second object in the second memory feature library.
  • features of multiple image samples of at least one object in the training set may be extracted, and the extracted features may be stored in the second memory feature library according to their identity.
  • the second network may include a fifth sub-network and a sixth sub-network, the fifth sub-network is used to aggregate the second feature with the first feature to obtain the first aggregated sub-feature; the sixth sub-network The network is used to update the first aggregation sub-feature to obtain the first target feature.
  • Step S14 Determine the target loss value based on the first target feature.
  • the target loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.
  • Step S15 based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.
  • the target loss value is compared with the threshold value, and if the target loss value is greater than the threshold value, the model parameters of the first model are updated; when the target loss value is not greater than the threshold value, the first model is determined as training After the first model. For another example, compare the target loss value with the last target loss value, and update the model parameters of the first model if the target loss value is greater than the last target loss value; In the case of a target loss value once, the first model is determined as the first model after training.
  • the first network of the first model by acquiring the first image sample containing the first object; using the first network of the first model to be trained, performing feature extraction on the first image sample to obtain the first feature of the first object; using The second network of the first model updates the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, and the similarity between each second object and the first object is different. less than the first threshold; determine the target loss value based on the first target feature; and update the model parameters of the first model at least once based on the target loss value to obtain the trained first model.
  • the characteristics of the second object are introduced as noise at the feature level of the first image sample containing the first object, and the overall network structure of the first model is trained, so that the robustness of the first model can be enhanced and the first model can be improved.
  • the model parameters of the first model are updated at least once. Since the target loss value is determined based on the first target feature, the first model after training can be improved. The consistency of the model's predictions for different image samples of the same object can further enable the trained first model to more accurately re-identify objects in images containing multiple objects.
  • the first image sample includes label information
  • the first model includes a first feature memory library
  • the first feature memory library includes at least one feature belonging to at least one object
  • the above step S14 includes step S141 to step S143, in:
  • Step S141 Determine a first loss value based on the first target feature and label information.
  • tag information may include, but not limited to, tag values, identifiers, and the like.
  • the first loss value may include, but not limited to, a cross-entropy loss value and the like.
  • the first loss value can be calculated by the following formula (1-1):
  • W is a linear matrix
  • W i and W j are the elements in W
  • y i represents the label information of the i-th object
  • f i represents the first target feature of the i-th object
  • ID S represents the total number of objects in the training set .
  • Step S142 Determine a second loss value based on the first target feature and at least one feature of at least one object in the first feature memory.
  • the second loss value may include but not limited to contrastive loss and the like.
  • Step S143 Determine a target loss value based on the first loss value and the second loss value.
  • the target loss value may include, but not limited to, the sum of the first loss value and the second loss value, the sum after weighting the first loss value and the second loss value respectively, and the like.
  • the target loss value can be calculated by the following formula (1-2):
  • step S142 includes step S1421 to step S1422, wherein:
  • Step S1421. From at least one feature of at least one object in the first feature memory, determine a first feature center of the first object and a second feature center of at least one second object.
  • the first feature center may be determined based on the features of the first object in the first feature memory and the first target feature.
  • Each second feature center may be determined based on each feature of each second object in the second feature memory.
  • the feature center of each object can be calculated by the following formula (1-3):
  • c k represents the feature center of the k-th object
  • B k represents the feature set belonging to the k-th object in the mini-batch
  • m is the set updated momentum coefficient
  • f i ′ is the first feature of the i-th sample .
  • m can be 0.2.
  • the feature center c k belonging to the object when f i ' and B k both belong to the same object, the feature center c k belonging to the object will change, and in the case that f i ' and B k do not belong to the same object, the feature center c k belonging to the object The feature center c k is consistent with the previous c k .
  • Step S1422. Determine a second loss value based on the first target feature, the first feature center and each second feature center.
  • the second loss value can be calculated by the following formula (1-4):
  • is a predefined temperature parameter
  • c i represents the first feature center of the i-th object
  • c j represents each second feature center
  • f i represents the first target feature of the i-th object
  • ID S represents the training The total number of objects in the set.
  • step S15 includes step S151 or step S152, wherein:
  • Step S151 if the target loss value does not meet the preset condition, update the model parameters of the first model to obtain an updated first model; based on the updated first model, determine a trained first model.
  • the manner of updating the model parameters of the first model may include but not limited to at least one of gradient descent method, momentum update method, Newton momentum method and the like.
  • those skilled in the art may independently determine the update mode according to actual needs, which is not limited in the embodiments of the present disclosure.
  • Step S152 if the target loss value satisfies the preset condition, determine the updated first model as the trained first model.
  • the preset conditions may include, but are not limited to, the target loss value being smaller than a threshold, the change of the target loss value converging, and the like.
  • those skilled in the art may independently determine the preset conditions according to actual needs, which are not limited by the embodiments of the present disclosure.
  • determining the first model after training based on the updated first model in step S151 includes steps S1511 to S1515, wherein:
  • Step S1511 acquiring the next first image sample
  • Step S1512 Using the updated first network of the first model to be trained, perform feature extraction on the next first image sample to obtain the next first feature;
  • Step S1513 using the updated second network of the first model to update the next first feature based on the second feature of at least one second object, to obtain the next first target feature corresponding to the next first feature;
  • Step S1514 based on the next first target feature, determine the next target loss value
  • Step S1515 Based on the next target loss value, perform at least one next update on the model parameters of the updated first model to obtain the trained first model.
  • step S1511 to step S1515 correspond to the above step S11 to step S15 respectively, and for implementation, reference may be made to the implementation manner of the above step S11 to step S15.
  • the model parameters of the first model are updated next time, and the first model after training is determined based on the first model after the next update, so that The performance of the trained first model can be further improved through continuous iterative updating.
  • the first feature memory library includes feature sets belonging to at least one object, each feature set includes at least one feature of the object to which it belongs, and the method further includes step S16, wherein:
  • Step S16 based on the first target feature, update the feature set belonging to the first object in the first feature storage.
  • the way of updating may include but not limited to adding the first target feature to the first feature storage, replacing a certain feature in the first feature storage with the first target feature, and so on.
  • the first feature center belonging to the first object can be accurately obtained, which further improves the recognition accuracy of the trained first model.
  • Fig. 2 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method includes steps S21 to S25, wherein:
  • Step S21 acquiring a first sub-image and a second sub-image containing the first object.
  • the second sub-image may be an image after at least occlusion processing is performed on the first sub-image.
  • the second sub-image may include at least one image.
  • the multiple images may be images obtained by at least performing occlusion processing on the first sub-image respectively.
  • Performing at least occlusion processing may include but not limited to only occlusion processing, or occlusion processing and other processing, and the like.
  • other processing may include, but not limited to, at least one of scaling, cropping, resizing, filling, flipping, color dithering, grayscale, Gaussian blur, and random erasing. A sort of.
  • those skilled in the art may use an appropriate processing method on the first sub-image to obtain the second sub-image according to actual conditions, which is not limited in the embodiments of the present disclosure.
  • step S21 includes step S211 to step S212, wherein:
  • Step S211 acquiring a first sub-image including a first object.
  • the first sub-image may be any suitable image containing at least the first object.
  • the content contained in the first sub-image may be determined according to an actual application scene, for example, only include the first object, or include at least one of the first object and an object, or other objects.
  • the first object may include, but is not limited to, people, animals, plants, objects, and the like.
  • the first sub-image is a face image containing Zhang San.
  • the first sub-image is an image including Li Si's whole person.
  • Step S212 based on the preset occlusion set, perform at least occlusion processing on the first sub-image to obtain a second sub-image.
  • the occlusion set includes at least one occlusion image.
  • the occlusion set may include, but is not limited to, one established based on at least one of a training set, other images, and the like.
  • the occlusion set includes at least a variety of occlusion object images, background images, etc., such as leaves, vehicles, trash cans, buildings, trees, flowers, and the like. For example, find image samples occluded by background and objects in the training set, and manually crop out the occluded parts to form an occlusion library.
  • a suitable image containing at least one object occlusion is selected, and the occlusion part is manually cut out to form an occlusion library.
  • those skilled in the art may choose an appropriate way to establish an occlusion set according to actual requirements, which is not limited by the embodiments of the present disclosure.
  • the position of the occluder may include, but not limited to, a specified position, a specified size, and the like.
  • the specified position can be set as a quarter of the four positions in one to half of the area.
  • those skilled in the art may determine the position of the barrier according to actual needs, which is not limited by the embodiments of the present disclosure.
  • performing at least occlusion processing may include, but is not limited to, occlusion processing and other processing.
  • the occlusion image is randomly selected from the occlusion library, and the size of the occlusion image is adjusted based on the adjustment rules.
  • the adjustment rule may include but not limited to adjusting the size of the occluder image, adjusting the size of the first image sample, and the like.
  • the height of the occluder image exceeds twice the width of the occluder image, it is considered to be vertical occlusion, and the height of the occluder image can be adjusted to the vertical height of the occluder image, and the width of the occluder image can be adjusted to the first image sample 1/4 to 1/2 of the width of the occluder image; otherwise, it is regarded as horizontal occlusion, and the width of the occluder image can be adjusted to the horizontal width of the occluder image, and the height of the occluder image can be adjusted to the height of the first image sample One quarter to one half.
  • those skilled in the art may determine the adjustment rule according to actual needs, which is not limited by the embodiments of the present disclosure.
  • occlusion processing including occlusion processing, resizing processing, filling processing, and cropping processing
  • firstly perform resizing processing, padding processing, and cropping processing on the first image sample
  • the method also includes step S213, wherein:
  • Step S213 based on the first sub-image and the second sub-image, determine an occlusion mask.
  • the occlusion mask is used to represent the occlusion information of the image.
  • the occlusion mask can be used for training the first model on object occlusion.
  • the occlusion mask may be determined based on pixel differences between the first sub-image and the second sub-image.
  • the difference between the first sub-image and the second sub-image can be calculated based on the following formula (2-1):
  • x represents the first sub-image
  • x' represents the second sub-image
  • step S213 includes steps S2131 to S2133, wherein:
  • Step S2131 Divide the first sub-image and the second sub-image into at least one first sub-part image and at least one second sub-part image respectively.
  • fine-grained occlusion masks tend to have many false labels due to misalignment of semantics (e.g., body parts) between different images, so the first sub-image and the second sub-image can be roughly horizontally Divided into a plurality of parts, the occlusion mask is determined based on pixel differences between each part of the first sub-image and each part of the second sub-image. For example, divided into four parts, divided into five parts, etc. During implementation, those skilled in the art may divide the first sub-image and the second sub-image according to actual requirements, which is not limited in the embodiments of the present disclosure.
  • Step S2132 based on each first sub-part image and each second sub-part image, determine an occlusion sub-mask.
  • the pixel difference between each first sub-part image and each second sub-part image can be obtained based on the above formula (2-1), and based on the pixel difference of each part, determine the mask.
  • Step S2133 Determine an occlusion mask based on each occlusion sub-mask.
  • d i is not less than the first threshold, it indicates that there is occlusion in this part of the image.
  • the occlusion sub-mask mask i can be set to 0. Otherwise, it indicates that there is no occlusion in this part.
  • mask i can be set to 1, then the corresponding occlusion mask mask is the occlusion sub-mask of each part.
  • the first sub-image and the second sub-image are divided into four parts, in the case that there is no occlusion in the first, second and third parts, and there is occlusion in the fourth part, then the occlusion mask mask at this time should be 1110.
  • those skilled in the art may determine the occlusion mask according to actual needs, which is not limited by the embodiments of the present disclosure.
  • Step S22 Using the first network of the first model to be trained, perform feature extraction on the first sub-image to obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image to obtain the first sub-feature of the first object. Two sub-features.
  • the first model may be any suitable model for object recognition based on image features.
  • the first model may include at least a first network.
  • the first sub-feature may include, but not limited to, the original feature of the first sub-image, or a feature obtained by processing the original feature.
  • the second sub-feature may include, but not limited to, the original feature of the second sub-image, or a feature obtained by processing the original feature.
  • the original features may include but not limited to facial features, body features, etc. of the objects contained in the image.
  • Step S23 using the second network of the first model, based on the second feature of at least one second object, to update the first sub-feature and the second sub-feature respectively, to obtain the first target sub-feature and the first target sub-feature corresponding to the first sub-feature The second target sub-feature corresponding to the second sub-feature.
  • the similarity between each second object and the first object is not less than the first threshold.
  • the first threshold may be preset or obtained by statistics. During implementation, those skilled in the art may independently determine the setting manner of the first threshold according to actual needs, which is not limited in the embodiments of the present disclosure.
  • the similarity between the facial features of the second object and the first object is not less than the first threshold.
  • the similarity between the wearing features of the second object and the first object is not less than the first threshold.
  • neither the similarity between the appearance characteristics of the second object nor the similarity between the clothing characteristics of the first object is less than the first threshold.
  • the second feature can be obtained based on the training set, or can be pre-input.
  • the second object may include, but is not limited to, people, animals, plants, objects, and the like.
  • the similarity between each second object and the first object may be obtained based on the similarity between the second feature of each second object and the first feature of the first object. In some implementations, the similarity between each second object and the first object may be obtained based on the similarity between the feature center of each second object and the first feature of the first object.
  • the first model may include a second memory feature library
  • the second memory feature library may include at least one feature of at least one object. The feature center of the second object may be obtained based on at least one feature belonging to the second object in the second memory feature library.
  • features of multiple image samples of at least one object in the training set may be extracted, and the extracted features may be stored in the second memory feature library according to their identity.
  • Step S24 Determine a target loss value based on the first target sub-feature and the second target sub-feature.
  • the target loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.
  • Step S25 based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.
  • the above-mentioned step S25 corresponds to the above-mentioned step S15, and the implementation manner of the above-mentioned step S15 can be referred to for implementation.
  • the second sub-image is an image after at least occlusion processing is performed on the first sub-image; using the first model to be trained
  • the first network performs feature extraction on the first sub-image to obtain the first sub-feature of the first object, and performs feature extraction on the second sub-image to obtain the second sub-feature of the first object; using the second sub-feature of the first model
  • the network based on the second feature of at least one second object, respectively updates the first sub-feature and the second sub-feature to obtain the first target sub-feature corresponding to the first sub-feature and the second target sub-feature corresponding to the second sub-feature feature, the similarity between each second object and the first object is not less than the first threshold; based on the first target sub-feature and the second target sub-feature, determine the target loss value; based on the target loss value, the model parameters of the first model At least one update is performed to obtain the trained first model.
  • the model parameters of the first model are updated at least once. Since the target loss value is determined based on the first target feature, it can be Improve the consistency of the prediction of the first model after training for different image samples of the same object, so that the first model after training can more accurately predict objects in images containing object occlusion and/or multiple objects Re-identify.
  • step S24 includes step S241 to step S243, wherein:
  • Step S241 Determine a first target loss value based on the first target sub-feature and the second target sub-feature.
  • the first target loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.
  • step S241 includes step S2411 to step S2413, wherein:
  • Step S2411 Based on the first target sub-feature, determine a third target sub-loss value.
  • step S2411 corresponds to the above-mentioned step S14, and the implementation manner of the above-mentioned step S14 can be referred to for implementation.
  • Step S2412. Based on the second target sub-feature, determine the fourth target sub-loss value.
  • step S2412 corresponds to the above-mentioned step S14, and the implementation of the above-mentioned step S14 can be referred to for implementation.
  • Step S2413 Determine the first target loss value based on the third target sub-loss value and the fourth target sub-loss value.
  • the first target loss value may include but not limited to the sum between the third target sub-loss value and the fourth target sub-loss value, the sum after weighting the third target sub-loss value and the fourth target sub-loss value, etc. .
  • those skilled in the art may determine the first target loss value according to actual needs, which is not limited by the embodiments of the present disclosure.
  • Step S242 Determine a second target loss value based on the first sub-feature and the second sub-feature.
  • the second target loss value may include but not limited to at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.
  • Step S243 Determine a target loss value based on the first target loss value and the second target loss value.
  • the target loss value may include, but not limited to, the sum of the first target loss value and the second target loss value, the sum after weighting the first target loss value and the second target loss value respectively, and the like.
  • those skilled in the art may determine the target loss value according to actual needs, which is not limited by the embodiments of the present disclosure.
  • the target loss value is determined based on the first sub-feature, the second sub-feature, the first target sub-feature and the second target sub-feature. In this way, the accuracy of the target loss value can be improved, so as to accurately judge whether the first model is converged.
  • the first network includes a first subnet and a second subnet
  • step S22 includes steps S221 to S222, wherein:
  • Step S221. Using the first sub-network of the first model to be trained, perform feature extraction on the first sub-image and the second sub-image respectively, to obtain the third sub-feature corresponding to the first sub-image and the third sub-feature corresponding to the second sub-image. Four features.
  • the first network includes at least a first subnetwork, and the first subnetwork is used to extract features of the image using a feature extractor.
  • the feature extractor may include, but is not limited to, RNN, CNN, a Transform-based feature extraction network, and the like.
  • those skilled in the art may use an appropriate first sub-network in the first model to obtain the third sub-feature according to actual conditions, which is not limited in the embodiments of the present disclosure.
  • a feature of the first sub-image is extracted through the first sub-network, and the feature is determined as a third sub-feature of the first object.
  • the third sub-feature may include but not limited to the original feature of the first sub-image and the like.
  • Step S222 using the second sub-network of the first model, determining the first sub-feature based on the third sub-feature, and determining the second sub-feature based on the fourth sub-feature.
  • the second sub-network may include an occlusion erasure network, which is used to perform occlusion erasure processing on input features and output unoccluded features.
  • the first sub-feature of the first object is obtained after occlusion and erasure processing is performed on the third sub-feature through the second sub-network.
  • the second sub-feature of the first object is obtained after the fourth sub-feature is occluded and erased through the second sub-network.
  • the overall network structure of the first model is trained by introducing the object image as noise at the picture level containing the first image sample of the first object, so that the robustness and the improvement of the first model can be enhanced.
  • the performance of the first model can further enable the trained first model to more accurately re-identify objects in images containing object occlusions.
  • step S242 includes step S2421 to step S2423, wherein:
  • Step S2421 Based on the first sub-feature and the second sub-feature, determine a first target sub-loss value.
  • the first target sub-loss value may include but not limited to at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.
  • Step S2422 Based on the third sub-feature and the fourth sub-feature, determine a second target sub-loss value.
  • the second target sub-loss value may include, but not limited to, at least one of a mean square error loss value, a cross-entropy loss value, a comparison loss value, and the like.
  • Step S2423 Determine a second target loss value based on the first target sub-loss value and the second target sub-loss value.
  • the second target loss value may include but not limited to the sum between the first target sub-loss value and the second target sub-loss value, the sum after weighting the first target sub-loss value and the second target sub-loss value, etc. .
  • those skilled in the art may determine the second target loss value according to actual needs, which is not limited by the embodiments of the present disclosure.
  • the second target loss value is determined based on the first sub-feature, the second sub-feature, the third sub-feature and the fourth sub-feature. In this way, the accuracy of the second target loss value can be improved, so as to accurately judge whether the first model converges.
  • the first sub-image includes label information
  • step S2422 includes steps S251 to S253, wherein:
  • Step S251. Determine a seventh sub-loss value based on the third sub-feature and label information.
  • tag information may include, but not limited to, tag values, identifiers, and the like.
  • the seventh sub-loss value may include but not limited to a cross-entropy loss value and the like.
  • the seventh sub-loss value can be calculated by the above formula (1-1), and at this time, f i in the formula (1-1) is the third sub-feature.
  • Step S252 Determine an eighth sub-loss value based on the fourth sub-feature and label information.
  • the eighth sub-loss value may include but not limited to a cross-entropy loss value and the like.
  • the eighth sub-loss value may be determined according to the above formula (1-1), at this time, f i in the formula (1-1) is the fourth sub-feature.
  • Step S253 Determine a second target sub-loss value based on the seventh sub-loss value and the eighth sub-loss value.
  • the second target sub-loss value may include, but not limited to, the sum between the seventh sub-loss value and the eighth sub-loss value, the sum after weighting the seventh sub-loss value and the eighth sub-loss value, and the like.
  • those skilled in the art may determine the second target sub-loss value according to actual requirements, which is not limited in the embodiments of the present disclosure.
  • the second target sub-loss value is determined based on the third sub-feature, the fourth sub-feature and label information. In this way, the accuracy of the second target sub-loss value can be improved, so as to accurately judge whether the first model is converged.
  • the second subnetwork includes a third subnetwork and a fourth subnetwork
  • step S222 includes steps S2221 to S2222, wherein:
  • Step S2221 using the third sub-network of the first model to determine the first occlusion score based on the third sub-feature, and determine the second occlusion score based on the fourth sub-feature.
  • the second sub-network includes at least a third sub-network
  • the third sub-network is used to perform semantic analysis based on features of the image to obtain an occlusion score corresponding to the image.
  • the third subnetwork includes a pooling subnetwork and at least one occlusion erasure subnetwork, the first occlusion score includes at least one first occlusion subscore, and the second occlusion score includes at least one second occlusion subscore;
  • the above step S2221 includes step 261 to step S262, wherein:
  • Step S26 Divide the third sub-feature into at least one third sub-part feature by using the pooling sub-network, and divide the fourth sub-feature into at least one fourth sub-part feature.
  • the pooling sub-network is used to divide the input feature to obtain at least one sub-part feature of the feature.
  • the number of third sub-section features may be the same as the number of first sub-images. For example, if the first sub-image is divided into four parts, then the third sub-feature can be divided into three third sub-part features through the pooling sub-network, and each third sub-part feature corresponds to f i .
  • Step S262. Using each occlusion erasure sub-network, determine a first occlusion sub-score based on each third sub-part feature, and determine a second occlusion sub-score based on each fourth sub-part feature.
  • each occlusion erasure sub-network is used to perform semantic analysis on the input feature to obtain the occlusion score of the image corresponding to the feature.
  • each occlusion erasing sub-network consists of two fully connected layers, a layer normalization and an activation function, wherein the layer normalization is located between the two fully connected layers, and the activation function is located at the end .
  • the activation function can be a sigmoid function.
  • the number of occlusion erasure sub-networks is the same as the number of first sub-image divisions. For example, the first sub-image is divided into four parts, and the corresponding feature of each part is f i .
  • the third sub-network includes four occlusion-erasing sub-networks, and each occlusion-erasing sub-network is used to output the corresponding Occlusion score.
  • the first sub-image is divided into five parts, and the corresponding feature of each part is f i .
  • the third sub-network includes five occlusion-erasing sub-networks, and each occlusion-erasing sub-module is used to output f i The corresponding occlusion score.
  • the occlusion score can be calculated by the following formula (2-2):
  • W cp is a matrix
  • W rg is a matrix
  • LN is layer normalization
  • c represents the channel dimension
  • fi represents the feature of the i-th part in the third sub-feature or the fourth sub-feature.
  • the third sub-feature is divided into four third sub-part features through the pooling sub-network, and each third sub-part feature is input into the corresponding occlusion erasure sub-network, based on the first fully connected layer W cp Compress the channel dimension to a quarter of the original, and perform layer normalization on the features of the compressed channel dimension, then compress the layer normalized features to one dimension, and finally output the third sub-part feature correspondence through the Sigmoid function
  • Step S2222. Using the fourth sub-network, determine the first sub-feature based on the third sub-feature and the first occlusion score, and determine the second sub-feature based on the fourth sub-feature and the second occlusion score.
  • the second subnetwork further includes a fourth subnetwork, and the fourth subnetwork is used to determine features after occlusion erasure.
  • step S2222 includes step S271 to step 272, wherein:
  • Step S271 using the fourth sub-network, based on each third sub-part feature of the third sub-feature and each first occlusion sub-score, determine the first sub-part feature, and based on each fourth sub-part feature of the fourth sub-feature
  • the partial feature and each second occlusion sub-score determine a second sub-part feature.
  • the first sub-part feature or the second sub-part feature can be calculated by the following formula (2-3):
  • si denotes the i-th occlusion score
  • fi denotes the i-th third sub-part feature or fourth sub-part feature.
  • the second feature memory may be updated based on the first sub-feature.
  • the way of updating may include, but not limited to, adding the first sub-feature to the second feature storage, replacing a certain feature in the second feature storage with the first sub-feature, and so on.
  • Step S272 Determine the first sub-feature based on each first sub-part feature, and determine the second sub-feature based on each second sub-part feature.
  • the first sub-features can be obtained by concatenating at least one first sub-feature.
  • the accuracy of the first sub-feature and the second sub-feature can be improved by using the pooling sub-network, at least one occlusion-erasing sub-network and the fourth sub-network.
  • the first sub-image includes label information
  • the first model includes a second feature memory
  • the second feature memory includes at least one feature belonging to at least one object
  • the above step S2421 includes steps S281 to S285, in:
  • Step S281. Determine an occlusion mask based on the first sub-image and the second sub-image.
  • step S281 corresponds to the above-mentioned step S213, and the implementation manner of the above-mentioned step S213 can be referred to for implementation.
  • Step S282. Determine a third loss value based on the first occlusion score, the second occlusion score and the occlusion mask.
  • the third loss value may include, but not limited to, a mean square error loss value and the like.
  • Step S283 Determine a fourth loss value based on the first sub-feature, the second sub-feature and label information.
  • the fourth loss value may include but not limited to a cross-entropy loss value and the like.
  • Step S284 Determine a fifth loss value based on the first sub-feature, the second sub-feature, and at least one feature of at least one object in the second feature memory.
  • the fifth loss value may include, but not limited to, a comparison loss value and the like.
  • Step S285 based on the third loss value, the fourth loss value and the fifth loss value, determine the first target sub-loss value.
  • the first target sub-loss value may include but not limited to the sum of the third loss value, the fourth loss value and the fifth loss value, after weighting the third loss value, the fourth loss value and the fifth loss value respectively and so on.
  • those skilled in the art may determine the first target sub-loss value according to actual needs, which is not limited in the embodiments of the present disclosure.
  • the first target sub-loss value is determined based on the occlusion mask, the first sub-feature, the second sub-feature, label information and other object characteristics. In this way, the accuracy of the first target sub-loss value can be improved, so as to accurately judge whether the first model is converged.
  • step S282 includes step S2821 to step S2823, wherein:
  • Step S2821 Determine a first sub-loss value based on the first occlusion score and the occlusion mask.
  • the first sub-loss value may include, but not limited to, a mean square error loss value and the like.
  • the first sub-loss value can be calculated according to the following formula (2-4):
  • N is the total number of occlusion erasure sub-networks
  • s i represents the i-th occlusion score
  • mask i represents the i-th occlusion sub-mask in the occlusion mask.
  • the occlusion mask mask is 1110
  • mask 1 is 1
  • mask 4 is 0 at this time.
  • Step S2822 Determine a second sub-loss value based on the second occlusion score and the occlusion mask.
  • the second sub-loss value may include, but not limited to, a mean square error loss value and the like.
  • the manner of determining the second sub-loss value may be the same as that of determining the first sub-loss value, see step S2821 for details.
  • Step S2823 Determine a third loss value based on the first sub-loss value and the second sub-loss value.
  • the third loss value may include, but not limited to, the sum of the first sub-loss value and the second sub-loss value, the sum after weighting the first sub-loss value and the second sub-loss value, and the like.
  • those skilled in the art may determine the third loss value according to actual requirements, which is not limited in the embodiments of the present disclosure.
  • the third loss value is determined based on the first occlusion score, the second occlusion score and the occlusion mask. In this way, the accuracy of the third loss value can be improved, so as to accurately judge whether the first model is converged.
  • step S283 includes step S2831 to step S2833, wherein:
  • Step S2831 Determine a third sub-loss value based on the first sub-feature and label information.
  • tag information may include, but not limited to, tag values, identifiers, and the like.
  • the third sub-loss value may include, but not limited to, a cross-entropy loss value and the like.
  • the third sub-loss value can be calculated by the above formula (1-1), at this time, f i in the formula (1-1) is the first sub-feature.
  • Step S2832 Determine a fourth sub-loss value based on the second sub-feature and label information.
  • the fourth sub-loss value may include but not limited to a cross-entropy loss value and the like.
  • the fourth sub-loss value can be calculated by the above formula (1-1), and at this time, f i in the formula (1-1) is the second sub-feature.
  • Step S2833 Determine a fourth loss value based on the third sub-loss value and the fourth sub-loss value.
  • the fourth loss value may include, but not limited to, the sum between the third sub-loss value and the fourth sub-loss value, the sum after weighting the third sub-loss value and the fourth sub-loss value, and the like.
  • those skilled in the art may determine the fourth loss value according to actual requirements, which is not limited in the embodiments of the present disclosure.
  • the fourth loss value is determined based on the first sub-feature, the second sub-feature and label information. In this way, the accuracy of the fourth loss value can be improved, so as to accurately judge whether the first model is converged.
  • step S284 includes step S2841 to step S2844, wherein:
  • Step S2841 From at least one feature of at least one object in the second feature memory, determine a third feature center of the first object and a fourth feature center of at least one second object.
  • the third feature center may be determined based on the feature of the first object in the second feature memory library and the first sub-feature.
  • Each fourth feature center may be determined based on each feature of each second object in the second feature memory.
  • the feature center of each object can be calculated by the following formula (2-5):
  • c x represents the feature center of the x-th object
  • B k represents the feature set belonging to the k-th object in the mini-batch
  • m is the set update momentum coefficient
  • f i ′ is the first subclass of the i-th sample. feature.
  • m can be 0.2.
  • the feature center c k belonging to the object when f i ' and B k both belong to the same object, the feature center c k belonging to the object will change, and in the case that f i ' and B k do not belong to the same object, the feature center c k belonging to the object The feature center c k is consistent with the previous c k .
  • Step S2842 based on the first sub-feature, the third feature center and each fourth feature center, determine the fifth sub-loss value.
  • the fifth sub-loss value may include but not limited to contrastive loss and the like.
  • the fifth sub-loss value can be calculated by the following formula (2-6):
  • is a predefined temperature parameter
  • c y represents the third feature center of the y-th object
  • c z represents the z-th fourth feature center
  • f i represents the first sub-feature of the i-th object
  • ID S represents The total number of objects in the training set.
  • Step S2843 based on the second sub-feature, the third feature center and each fourth feature center, determine the sixth sub-loss value.
  • the sixth sub-loss value may include but not limited to contrastive loss and the like.
  • the manner of determining the sixth sub-loss value may be the same as that of determining the fifth sub-loss value, see step S2842 for details.
  • Step S2844 Determine a sixth loss value based on the fifth sub-loss value and the sixth sub-loss value.
  • the sixth loss value may include, but not limited to, the sum between the fifth sub-loss value and the sixth sub-loss value, the sum after weighting the fifth sub-loss value and the sixth sub-loss value, and the like.
  • those skilled in the art may determine the sixth loss value according to actual needs, which is not limited in the embodiments of the present disclosure.
  • the sixth loss value is determined based on the first sub-feature, the second sub-feature and other object characteristics. In this way, the accuracy of the sixth loss value can be improved, so as to accurately judge whether the first model is converged.
  • the second network includes a fifth subnetwork and a sixth subnetwork
  • step S23 includes steps S231 to S232, wherein:
  • Step S231 using the fifth sub-network to aggregate the first sub-feature and the second sub-feature with the second feature of at least one second object respectively, to obtain the first aggregated sub-feature and the second sub-feature corresponding to the first sub-feature The corresponding second aggregate subfeature.
  • the second network includes at least a fifth sub-network
  • the fifth sub-network is used to aggregate the first sub-features with the second features of at least one second object to obtain the first aggregated sub-features, and combine the second sub-features with A second feature of at least one second object is aggregated to obtain a second aggregated sub-feature.
  • Step S232 Using the sixth sub-network, determine the first target sub-feature based on the first aggregated sub-feature, and determine the second target sub-feature based on the second aggregated sub-feature.
  • the second network further includes a sixth sub-network for determining the first target sub-feature based on the first aggregated sub-feature, and determining the second target sub-feature based on the second aggregated sub-feature.
  • the overall network structure of the first model is trained by introducing the features of the second object as noise at the feature level of the first image sample containing the first object, so that the robustness of the first model can be enhanced and improve the performance of the first model, thereby enabling the trained first model to more accurately re-identify objects in images containing multiple objects.
  • step S231 includes step S2311 to step S2314, wherein:
  • Step S2311 based on the first sub-feature and each second feature, determine a first attention matrix.
  • the first attention matrix is used to represent the degree of association between the first sub-feature and each second feature.
  • X second features belonging to at least one second object are determined, where X is a positive integer.
  • X can be 10.
  • the X second features closest to the first sub-features belonging to the second object can be searched in the second feature memory library, and based on each second feature, X first sub-features can be determined. center When looking up, it can be calculated based on the cosine distance between features.
  • the network parameters of the fifth sub-network include a first prediction matrix and a second prediction matrix
  • step S2311 includes steps S2321 to S2323, wherein:
  • Step S2321 based on the first sub-feature and the first prediction matrix, determine the first prediction feature.
  • the first predictive feature can be calculated by the following formula (2-7):
  • f' represents the first sub-feature
  • Both d and d' are the feature dimensions of f'.
  • Step S2322. Based on each second feature and the second predictive matrix, determine a second predictive feature.
  • the second predictive feature can be calculated by the following formula (2-8):
  • Both d and d' are feature dimensions of the first sub-feature.
  • Step S2323 Determine a first attention matrix based on the first predictive feature and each second predictive feature.
  • the first attention matrix can be determined by the following formula (2-9):
  • X represents the total number of second features, i ⁇ 1, 2, ... X, is a scaling factor.
  • Step S2312 based on each second feature and each first attention matrix, determine the first aggregation sub-feature.
  • the network parameters of the fifth sub-network also include a third prediction matrix
  • step S2312 includes steps S2331 to S2332, wherein:
  • Step S2331. Based on each second feature and the third predictive matrix, determine a third predictive feature.
  • the third predictive feature can be calculated by the following formula (2-10):
  • Both d and d' are feature dimensions of the first sub-feature.
  • Step S2332 based on each third predictive feature and each first attention matrix, determine the first aggregation sub-feature.
  • the first aggregation sub-feature can be determined by the following formula (2-11):
  • m i represents the i-th first attention matrix
  • f vi represents the i-th third predictive feature
  • Step S2313 Determine a second attention matrix based on the second sub-features and each second feature.
  • the second attention matrix is used to characterize the degree of association between the second sub-features and each second feature.
  • the manner of determining the second attention matrix may be the same as that of determining the first attention matrix, see step S2321 to step S2323.
  • Step S2314 based on each second feature and each second attention matrix, determine a second aggregation sub-feature.
  • the manner of determining the second aggregation sub-feature may be the same as that of determining the first aggregation sub-feature, see step S2331 to step S2332 for details.
  • each first center is divided into multiple parts by multi-head operation, and attention weight is assigned to each part, so as to ensure that more unique patterns similar to target objects and non-target objects can be aggregated to The robustness of the first model is enhanced, so that the trained first model can more accurately re-identify objects in images containing multiple objects.
  • the sixth subnetwork includes the seventh subnetwork and the eighth subnetwork, and the above step S232 includes steps S2341 to S2343, wherein:
  • Step S2341 Determine an occlusion mask based on the first sub-image and the second sub-image.
  • the occlusion mask is used to represent the occlusion information of the image.
  • the occlusion mask may be determined based on pixel differences between the first sub-image and the second sub-image.
  • Step S2342 Using the seventh sub-network, determine the fifth sub-feature based on the first aggregation sub-feature and the occlusion mask, and determine the sixth sub-feature based on the second aggregation sub-feature and the occlusion mask.
  • the seventh sub-network may be an FFN 1 ( ⁇ ) neural network including two fully connected layers and an activation function.
  • the fifth sub-feature or the sixth sub-feature can be obtained by the following formula (2-12):
  • mask is the occlusion mask and f d is the first aggregated sub-feature or the second aggregated sub-feature.
  • Step S2343 Using the eighth sub-network, determine the first target sub-feature based on the first sub-feature and the fifth sub-feature, and determine the second target sub-feature based on the second sub-feature and the sixth sub-feature.
  • the eighth sub-network may be an FFN 2 ( ⁇ ) neural network including two fully connected layers and an activation function.
  • the first target sub-feature or the second target sub-feature can be obtained by the following formula (2-13):
  • f" is the fifth sub-feature or the sixth sub-feature
  • f' is the first sub-feature or the second sub-feature
  • the target feature is obtained, which can ensure that the features of other objects are only added to the human body part of the first object and not the pre-identified object occlusion part , in order to better simulate the features of multi-pedestrian images.
  • Fig. 3 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 3, the method includes steps S31 to S37, wherein:
  • Step S31 acquiring a first image sample including a first object.
  • Step S32 using the first network of the first model to be trained, to perform feature extraction on the first image sample to obtain the first feature of the first object.
  • Step S33 using the second network of the first model to update the first feature based on the second feature of at least one second object to obtain the first target feature corresponding to the first feature, and each second object is related to the first object
  • the similarity of is not less than the first threshold.
  • Step S34 Determine a target loss value based on the first target feature.
  • Step S35 based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.
  • steps S31 to S35 correspond to the above-mentioned steps S11 to S15 respectively, and for implementation, reference may be made to the specific implementation manners of the above-mentioned steps S11 to S15.
  • Step S36 Determine an initial second model based on the trained first model.
  • the network of the trained first model may be adjusted according to an actual usage scenario, and the adjusted first model may be determined as the initial second model.
  • the first model includes a first network and a second network, the second network in the trained first model can be removed, and the first network of the first model can be adjusted according to the actual scene , and determine the adjusted first model as the initial second model.
  • Step S37 based on at least one second image sample, update the model parameters of the second model to obtain a trained second model.
  • the second image sample may have label information, or may not have label information.
  • those skilled in the art may determine a suitable second image sample according to an actual application scenario, which is not limited here.
  • fine-tuning training may be performed on model parameters of the second model to obtain a trained second model.
  • an initial second model is determined based on the trained first model, and model parameters of the second model are updated based on at least one second image sample to obtain a trained second model.
  • the model parameters of the trained first model can be migrated to the second model to be applicable to various application scenarios, which can not only reduce the amount of calculation in practical applications, but also improve the training efficiency and training efficiency of the second model. After the detection accuracy of the second model.
  • Fig. 4 is an image recognition method provided by an embodiment of the present disclosure. As shown in Fig. 4, the method includes steps S41 to S42, wherein:
  • Step S41 acquiring a first image and a second image.
  • the first image and the second image may be any suitable images to be recognized. During implementation, those skilled in the art may select an appropriate image according to an actual application scenario, which is not limited by the embodiments of the present disclosure.
  • the first image may include an occluded image or an unoccluded image.
  • the sources of the first image and the second image may be the same or different.
  • both the first image and the second image are images captured by a camera.
  • the first image is an image captured by a camera
  • the second image may be a frame of an image in a video.
  • Step S42 using the trained target model, to recognize the object in the first image and the object in the second image, and obtain a recognition result.
  • the trained target model may include but not limited to at least one of the first model and the second model.
  • the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.
  • the first target feature corresponding to the first image and the second target feature corresponding to the second image are obtained respectively, and based on the similarity between the first target feature and the second target feature, it is obtained The recognition result.
  • the model training method in the above embodiment can introduce real noise at the feature level, or introduce real noise at both the picture level and the feature level, the overall network structure of the target model is trained, and the target model is enhanced.
  • the robustness of the target model has effectively improved the performance of the target model. Therefore, based on the first model and/or the second model obtained by using the model training method in the above embodiment to identify the image, the pedestrian can be more accurately identified. Re-identify.
  • FIG. 5A is a schematic diagram of the composition and structure of a model training system 50 provided by an embodiment of the present disclosure. As shown in FIG. 54 and feature memory part 55, wherein:
  • the augmentation part 51 is configured to at least perform occlusion processing on the first sub-image containing the first object to obtain the second sub-image.
  • the occlusion erasing part 52 is configured to use the first network of the first model to be trained to perform feature extraction on the first sub-image, obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image, Get the second subfeature of the first object.
  • the feature diffusion part 53 is configured to use the second network of the first model to update the first sub-feature and the second sub-feature respectively based on the second feature of at least one second object, and obtain the first sub-feature corresponding to the first sub-feature A target sub-feature and a second target sub-feature corresponding to the second sub-feature, the similarity between each second object and the first object is not less than the first threshold.
  • the updating part 54 is configured to determine a target loss value based on the first target sub-feature and the second target sub-feature; based on the target loss value, update the model parameters of the first model at least once to obtain the trained first model.
  • the feature memory part 55 is configured to store at least one feature of at least one object.
  • the feature memory part 55 includes a first feature memory and a second feature memory, the first feature memory is used to store the first sub-feature of at least one object, and the second feature memory is used to store at least The first target subfeature of an object.
  • FIG. 5B is a schematic diagram of a model training system 500 provided by an embodiment of the present disclosure.
  • the model training system 500 performs augmentation processing on an input first image 501 to obtain a second image 502, and converts the first image to After an image 501 and a second image 502 are input to the occlusion and erasing part 52, the first sub-feature f1' and the second sub-feature f2' are obtained respectively, and the second feature memory library 552 is updated based on the first sub-feature f1', and the second After a sub-feature f1', a second sub-feature f2' and at least one feature of at least one other object selected from the second feature memory bank 552 are input to the feature diffusion part 53, the first target sub-feature fd1' and the second sub-feature are respectively obtained.
  • the target sub-feature fd2 ′ based on the first target sub-feature fd1 ′, updates the first feature storage 551 , and the network parameters
  • the augmentation part 51 is further configured to: determine an occlusion mask based on the first sub-image and the second sub-image.
  • Fig. 5C is a schematic diagram of determining an occlusion mask provided by an embodiment of the present disclosure. As shown in Fig. 5C, a pixel comparison operation 503 is performed between the first sub-image 501 and the second sub-image 502, and after the pixel comparison operation 503 , perform a binarization operation 504 on the comparison result, and obtain a corresponding occlusion mask 505 after the binarization operation 504 .
  • the first network includes a first sub-network and a second sub-network
  • the occlusion erasing part 52 is further configured to: use the first sub-network of the first model to be trained to respectively perform the first sub-image Perform feature extraction with the second sub-image to obtain the third sub-feature corresponding to the first sub-image and the fourth sub-feature corresponding to the second sub-image; use the second sub-network of the first model to determine the first sub-feature based on the third sub-feature sub-features, and determine the second sub-features based on the fourth sub-features.
  • FIG. 5D is a schematic diagram of a first network 510 provided by an embodiment of the present disclosure.
  • the first network 510 includes a first sub-network 511 and a second sub-network 512.
  • the sub-image 502 is input into the first sub-network 511 to obtain the third sub-feature f1 corresponding to the first sub-image 501, the fourth sub-feature f2 corresponding to the second sub-image 502, and the third sub-feature f1 and the fourth sub-feature f2 is input into the second sub-network 512 to obtain the first sub-feature f1' and the second sub-feature f2'.
  • the second subnetwork includes a third subnetwork and a fourth subnetwork
  • the occlusion erasing part 52 is further configured to: use the third subnetwork of the first model to determine the first occlusion score, and determine the second occlusion score based on the fourth sub-feature; utilize the fourth sub-network, based on the third sub-feature and the first occlusion score, determine the first sub-feature, and based on the fourth sub-feature and the second occlusion score, Determine the second sub-feature.
  • FIG. 5E is a schematic diagram of a second subnetwork 512 provided by an embodiment of the present disclosure.
  • the second subnetwork 512 includes a third subnetwork 521 and a fourth subnetwork 522, and the third subnetwork f1 and The fourth sub-feature f2 is input into the third sub-network 521, and the first occlusion score s1 corresponding to the third sub-feature f1 and the second occlusion score s2 corresponding to the fourth feature f2 are respectively obtained, and the first occlusion score s1 and the second occlusion score
  • the three sub-features f1 are input to the fourth sub-network 522 to obtain the first sub-feature f1', and the second occlusion score s2 and the fourth sub-feature f2 are input to the fourth sub-network 522 to obtain the second sub-feature f2'.
  • the second network includes a fifth sub-network and a sixth sub-network
  • the feature diffusion part 53 is further configured to: use the fifth sub-network to combine the first sub-feature and the second sub-feature with at least one
  • the second feature of the second object is aggregated to obtain the first aggregated sub-feature corresponding to the first sub-feature and the second aggregated sub-feature corresponding to the second sub-feature
  • the sixth sub-network is used to determine the first aggregated sub-feature based on the first aggregated sub-feature target sub-features, and determine second target sub-features based on the second aggregated sub-features.
  • FIG. 5F is a schematic diagram of a second network 520 provided by an embodiment of the present disclosure.
  • the second network 520 includes a fifth sub-network 521 and a sixth sub-network 522, and the first sub-feature f1' is input to
  • the fifth sub-network 521 searches the second feature storage 552 for K nearest first centers belonging to the second object based on the first sub-feature f1′ Based on the first sub-feature f1′ and the first prediction matrix W 1 , determine the first prediction feature f q , based on the first center and the second prediction matrix W 2 , determine the second prediction feature f c , based on the first center and the third prediction matrix W 3 , to determine the third prediction feature f v .
  • the first attention matrix m i is determined based on the first prediction feature f q and the second prediction feature f c
  • the first aggregation sub-feature f d is determined based on the first attention matrix m i and the third prediction feature f v .
  • a target sub-feature f d ′ A target sub-feature f d ′.
  • the feature diffusion part 53 is further configured to: determine a first attention matrix based on the first sub-feature and each second feature, and the first attention matrix is used to characterize the first sub-feature and each second feature The degree of association between the second features; based on each second feature and each first attention matrix, determine the first aggregation sub-feature; based on the second sub-feature and each second feature, determine the second attention matrix, The second attention matrix is used to characterize the degree of association between the second sub-features and each second feature; based on each second feature and each second attention matrix, the second aggregated sub-features are determined.
  • the network parameters of the fifth sub-network include a first prediction matrix and a second prediction matrix
  • the feature diffusion part 53 is further configured to: determine the first prediction feature based on the first sub-feature and the first prediction matrix ; Based on each second feature and the second predictive matrix, determine a second predictive feature; determine the first attention matrix based on the first predictive feature and each second predictive feature.
  • the network parameters of the fifth sub-network include a third predictive matrix
  • the feature diffusion part 53 is further configured to: determine a third predictive feature based on each second feature and the third predictive matrix; The third predictive feature and each of the first attention matrices determine a first aggregated sub-feature.
  • the sixth sub-network includes a seventh sub-network and an eighth sub-network
  • the feature diffusion part 53 is further configured to: use the seventh sub-network to determine the fifth sub-network based on the first aggregation sub-feature and the occlusion mask Sub-features, and determine the sixth sub-feature based on the second aggregated sub-feature and the occlusion mask; use the eighth sub-network, based on the first sub-feature and the fifth sub-feature, determine the first target sub-feature, and based on the second sub-feature and a sixth sub-feature to determine the second target sub-feature.
  • the updating part 54 is further configured to: determine the first target loss value based on the first target sub-feature and the second target sub-feature; determine the second target loss value based on the first sub-feature and the second sub-feature Loss value; determining the target loss value based on the first target loss value and the second target loss value; based on the target loss value, updating the model parameters of the first model at least once to obtain the trained first model.
  • the updating part 54 is further configured to: update the model parameters of the first model when the target loss value does not meet the preset condition, to obtain the updated first model, based on the updated The first model is to determine the first model after training; if the target loss value satisfies the preset condition, determine the updated first model as the first model after training.
  • the updating part 54 is further configured to: determine the first target sub-loss value based on the first sub-feature and the second sub-feature; determine the second target sub-loss value based on the third sub-feature and the fourth sub-feature Loss value: determining a second target loss value based on the first target sub-loss value and the second target sub-loss value.
  • the first sub-image includes label information
  • the first model includes a second feature memory bank
  • the second feature memory bank includes at least one feature belonging to at least one object
  • the updating part 54 is further configured to: based on Determine the third loss value based on the first occlusion score, the second occlusion score and the occlusion mask; determine the fourth loss value based on the first sub-feature, the second sub-feature and label information; based on the first sub-feature, the second sub-feature and at least one feature of at least one object in the second feature memory bank to determine a fifth loss value; based on the third loss value, the fourth loss value and the fifth loss value, determine the first target sub-loss value.
  • the updating part 54 is further configured to: determine the first sub-loss value based on the first occlusion score and the occlusion mask; determine the second sub-loss value based on the second occlusion score and the occlusion mask; The first sub-loss value and the second sub-loss value determine the third loss value.
  • the updating part 54 is further configured to: determine a third sub-loss value based on the first sub-feature and label information; determine a fourth sub-loss value based on the second sub-feature and label information; The sub-loss value and the fourth sub-loss value determine the fourth loss value.
  • the updating part 54 is further configured to: determine the third feature center of the first object and the fourth feature center of at least one second object from at least one feature of at least one object in the second feature memory library. feature center; based on the first sub-feature, the third feature center and each fourth feature center, determine the fifth sub-loss value; based on the second sub-feature, the third feature center and each fourth feature center, determine the sixth sub- A loss value; based on the fifth sub-loss value and the sixth sub-loss value, a fifth loss value is determined.
  • the updating part 54 is further configured to: determine the seventh sub-loss value based on the third sub-feature and label information; determine the eighth sub-loss value based on the fourth sub-feature and label information; The sub-loss value and the eighth sub-loss value determine the second target sub-loss value.
  • FIG. 5G is a schematic diagram of obtaining a target loss value 540 provided by an embodiment of the present disclosure.
  • the target loss value 540 mainly includes three parts: feature extraction, occlusion erasing part 52 and feature diffusion part 53. loss value, where:
  • the loss values for this part of feature extraction include:
  • the loss values for this part of the occlusion erasure part 52 include:
  • the loss values for this part of the characteristic diffusion part 53 include:
  • the ninth sub-loss value Loss11 (corresponding to the above-mentioned first loss value) determined based on the label information of the first target sub-feature fd1' and the first sub-image 501, based on the second target sub-feature fd2' and the first sub-image 501
  • the tenth sub-loss value Loss12 determined by the label information (corresponding to the above-mentioned first loss value);
  • the eleventh sub-loss value Loss21 (corresponding to the above-mentioned second loss value) determined based on the first target sub-feature fd1' and the first feature memory bank 551 is determined based on the second target sub-feature fd2' and the first feature memory bank 551
  • the twelfth sub-loss value Loss22 (corresponding to the above-mentioned second loss value).
  • the model training system further includes: a second determination part and a third determination part; the second determination part is configured to determine an initial second model based on the trained first model; the third determination part, It is configured to update the model parameters of the second model based on at least one second image sample to obtain the trained second model.
  • the method provided by the embodiment of the present disclosure has at least the following improvements:
  • pedestrian re-identification (re-identification, ReID) modeling is mainly based on pose estimation algorithms or human body analysis algorithms for auxiliary training.
  • the modeling of pedestrian re-identification in the embodiment of the present disclosure uses deep learning to perform occluded pedestrian re-identification.
  • a feature erasing and diffusion network (Feature Erasing and Diffusion Network, FED) is proposed to simultaneously process NPO and NTP, specifically, based on occlusion erasing
  • the module (Erasing Module, OEM) eliminates NPO features, supplemented by NPO augmentation strategy to simulate NPO on the overall pedestrian image, and generates accurate occlusion masks.
  • the pedestrian features and other memory features are diffused to synthesize NTP features in the feature space, which realizes the simulation of NPO occlusion interference at the image level and NTP interference at the feature level.
  • TP target Pedestrians
  • the method provided by the embodiments of the present disclosure has at least the following beneficial effects: 1) Make full use of the occlusion information of the picture and the characteristics of other pedestrians to simulate the interference of non-pedestrian occlusion and non-target pedestrians, and can better comprehensively analyze various influencing factors, Improve the model's perception of TP; 2) Use deep learning to make the results of pedestrian re-identification more accurate, and improve the accuracy of pedestrian re-identification in real and complex scenes.
  • Occluded-DukeMTMC O-Duke
  • Occluded-REID O-REID
  • Partial-REID P-REID
  • CMC Cumulative Matching Characteristic
  • mAP mean Average Precision
  • Table 1 shows the performance comparison of each pedestrian ReID method on the three data sets of O-Duke, O-REID and P-REID. Since there is no corresponding training set for O-REID and P-REID, the model trained on Market-1501 is used for testing.
  • Pedestrian ReId methods include: Part-based Convonlutional Baseline (PCB), Deep spatial feature reconstruction (DSR), High-Order re-identification (HOReID 27), Part-aware Transformer (Part-aware Transformer, PAT), Transformer-based ReID (Transformer-based Object Re-Identification, TransReID) adopts Vision without sliding window settings
  • FED achieves the highest Rank-1 and mAP on both O-Duke and O-REID datasets. Especially on the O-REID dataset, it reached 86.3%/79.3% on Rank-1/mAP, surpassing other methods by at least 4.7%/2.6%. On O-Duke, it reaches 68.1%/56.4% on Rank-1/mAP, surpassing other methods by at least 3.6%/0.7%. On the P-REID dataset, the highest mAP accuracy is achieved, reaching 80.5%, which exceeds other methods by 3.9%. Therefore, a good performance is achieved on the occluded ReID dataset.
  • NPO Augmentation NPO Aug
  • OEM and FDM are shown.
  • Numbers 1 to 5 represent baseline, baseline+NPO Aug, baseline+NPO Aug+OEM, baseline+NPO Aug+FDM and FED, respectively.
  • Model 1 uses ViT as the feature extractor and is optimized by cross-entropy loss (ID Loss) and Triplet Loss.
  • ID Loss cross-entropy loss
  • Triplet Loss Triplet Loss.
  • FDM improves Rank-1 and mAP by 1.7% and 2.4%, respectively. This means that optimizing a network with diffusion features can greatly improve the model's perception of TP. In the end, FED achieved the highest accuracy, showing that each component works both individually and together.
  • the number of searches K in the feature memory search operation is analyzed.
  • K is set as 2, 4, 6 and 8, and experiments are performed on DukeMTMC-reID, Market-1501 and Occlude-DukeMTMC.
  • the performance of the two overall person ReID datasets, DukeMTMC-reID and Market-1501 is stable on various K, within 0.5%.
  • NPO and NTP are few, failing to highlight the effectiveness of FDM.
  • For DukeMTMC-reID a large amount of training data comes with NPO and NTP, and the loss constraints can make the network have high accuracy.
  • Occluded-DukeMTMC since all the training data are overall pedestrians, the introduction of FDM can greatly simulate the multi-pedestrian situation in the test set. As K increases, FDM can better preserve the characteristics of TP and introduce realistic noise.
  • FIG. 5H is a schematic diagram of occlusion scores of pedestrian images provided by an embodiment of the present disclosure.
  • occlusion scores of some pedestrian images from OEMs are shown. Images with NPO and non-target pedestrian NTP are shown. From Fig. 6H, it can be seen that for graphs 551 and 552 with vertical object occlusion, the occlusion score is hardly affected, because symmetric pedestrians with less than half occlusion are not a critical issue for pedestrian ReID.
  • OEMs can accurately identify NPOs and flag them with a small occlusion score.
  • maps 555 and 556 with multiple pedestrian image occlusions the OEM identifies each stripe as valuable. Therefore, the subsequent FDM is crucial to improve the model performance.
  • FIG. 5I is a schematic diagram of an image retrieval result provided by an embodiment of the present disclosure. As shown in FIG. 5I , it shows the retrieval results of TransReID and FED.
  • Figure 561 and Figure 562 are object occlusion images. It is obvious that FED has a better recognition ability for NPO and can accurately retrieve the target pedestrian.
  • Figure 563 and Figure 564 are multi-pedestrian images, and FED has a stronger perception of TP and achieves higher retrieval accuracy.
  • FIG. 6 is a schematic diagram of the composition and structure of a model training device provided by an embodiment of the present disclosure.
  • the model training device 60 includes a first acquisition part 61 , feature extraction part 62 , first update part 63 , first determination part 64 and second update part 65 .
  • a first acquiring part 61 configured to acquire a first image sample containing a first object
  • the feature extraction part 62 is configured to use the first network of the first model to be trained to perform feature extraction on the first image sample to obtain the first feature of the first object;
  • the first updating part 63 is configured to use the second network of the first model to update the first features respectively based on the second features of at least one second object to obtain the first target features corresponding to the first features, each the similarity between the second object and the first object is not less than a first threshold;
  • the first determining part 64 is configured to determine a target loss value based on the first target feature
  • the second updating part 65 is configured to update the model parameters of the first model at least once based on the target loss value to obtain the trained first model.
  • the first image sample includes label information
  • the first model includes a first feature memory
  • the first feature memory includes at least one feature belonging to at least one object
  • the first determination part 64 is further configured To: determine the first loss value based on the first target feature and label information; determine the second loss value based on the first target feature and at least one feature of at least one object in the first feature memory; based on the first loss value and The second loss value determines the target loss value.
  • the first determining part 64 is further configured to: determine the first feature center of the first object and the at least one second object from at least one feature of at least one object in the first feature memory library second feature center; determining a second loss value based on the first target feature, the first feature center, and each second feature center.
  • the first feature memory library includes feature sets belonging to at least one object, each feature set includes at least one feature of the object to which it belongs, and the device further includes: a third updating part configured to feature, updating the feature set belonging to the first object in the first feature memory.
  • the first acquisition part 61 is further configured to: acquire the first sub-image and the second sub-image containing the first object, the second sub-image is an image obtained by at least performing occlusion processing on the first sub-image
  • the feature extraction part 62 is also configured to: use the first network of the first model to be trained to perform feature extraction on the first sub-image, obtain the first sub-feature of the first object, and perform feature extraction on the second sub-image Extract to obtain the second sub-feature of the first object;
  • the first update part 63 is also configured to: use the second network of the first model, based on the second feature of at least one second object, to the first sub-feature and the first sub-feature
  • the two sub-features are updated respectively to obtain the first target sub-feature corresponding to the first sub-feature and the second target sub-feature corresponding to the second sub-feature;
  • the first determining part 64 is also configured to: based on the first target sub-feature and The second target sub-feature determines the target loss value.
  • the first determination part 64 is further configured to: determine the first target loss value based on the first target sub-feature and the second target sub-feature; Two target loss values; based on the first target loss value and the second target loss value, determine the target loss value.
  • the first acquisition part 61 is further configured to: acquire the first sub-image containing the first object; based on the preset occlusion set, at least perform occlusion processing on the first sub-image to obtain the second sub-image , the occlusion set includes at least one occlusion image.
  • the first network includes a first sub-network and a second sub-network
  • the feature extraction part 62 is further configured to: use the first sub-network of the first model to be trained to respectively perform the first sub-image and Feature extraction is performed on the second sub-image to obtain the third sub-feature corresponding to the first sub-image and the fourth sub-feature corresponding to the second sub-image; use the second sub-network of the first model to determine the first sub-feature based on the third sub-feature feature, and determine the second sub-feature based on the fourth sub-feature.
  • the first determining part 64 is further configured to: determine the first target sub-loss value based on the first sub-feature and the second sub-feature; determine the second target sub-loss value based on the third sub-feature and the fourth sub-feature A target sub-loss value; determining a second target loss value based on the first target sub-loss value and the second target sub-loss value.
  • the first sub-image includes label information
  • the first determining part 64 is further configured to: determine a seventh sub-loss value based on the third sub-feature and label information; based on the fourth sub-feature and label information, Determine an eighth sub-loss value; determine a second target sub-loss value based on the seventh sub-loss value and the eighth sub-loss value.
  • the second sub-network includes a third sub-network and a fourth sub-network
  • the feature extraction part 62 is further configured to: use the third sub-network of the first model to determine the first occlusion based on the third sub-feature score, and determine the second occlusion score based on the fourth sub-feature; use the fourth sub-network, based on the third sub-feature and the first occlusion score, determine the first sub-feature, and based on the fourth sub-feature and the second occlusion score, determine Second sub-feature.
  • the third subnetwork includes a pooling subnetwork and at least one occlusion erasure subnetwork
  • the first occlusion score includes at least one first occlusion subscore
  • the second occlusion score includes at least one second occlusion subscore
  • the feature extraction part 62 is also configured to: divide the third sub-feature into at least one third sub-part feature by using the pooling sub-network, and divide the fourth sub-feature into at least one fourth sub-part feature; use each The occlusion erasure sub-network determines each first occlusion subscore based on each third subsection feature, and determines each second occlusion subscore based on each fourth subsection feature.
  • the feature extraction part 62 is further configured to: use the fourth sub-network to determine the first sub-part feature based on each third sub-part feature and each first occlusion sub-score of the third sub-feature , and based on each fourth sub-part feature and each second occlusion sub-score of the fourth sub-feature, determine the second sub-part feature; based on each first sub-part feature, determine the first sub-feature, and based on each The second sub-part feature, to determine the second sub-feature.
  • the first sub-image includes label information
  • the first model includes a second feature memory bank
  • the second feature memory bank includes at least one feature belonging to at least one object
  • the first determining part 64 is further configured to : Determine the occlusion mask based on the first sub-image and the second sub-image; determine the third loss value based on the first occlusion score, the second occlusion score and the occlusion mask; based on the first sub-feature, the second sub-feature and the label information, determine a fourth loss value; based on at least one feature of at least one object in the first sub-feature, the second sub-feature, and the second feature memory bank, determine the fifth loss value; based on the third loss value, the fourth loss value and the fifth loss value to determine the first target sub-loss value.
  • the first determining part 64 is further configured to: divide the first sub-image and the second sub-image into at least one first sub-part image and at least one second sub-part image; An occlusion sub-mask is determined for a sub-partial image and each second sub-partial image; based on each occlusion sub-mask, an occlusion mask is determined.
  • the first determining part 64 is further configured to: determine the first sub-loss value based on the first occlusion score and the occlusion mask; determine the second sub-loss value based on the second occlusion score and the occlusion mask ; Determine a third loss value based on the first sub-loss value and the second sub-loss value.
  • the first determining part 64 is further configured to: determine a third sub-loss value based on the first sub-feature and label information; determine a fourth sub-loss value based on the second sub-feature and label information; The third sub-loss value and the fourth sub-loss value determine the fourth loss value.
  • the first determination part 64 is further configured to: determine the third feature center of the first object and the The fourth feature center; based on the first sub-feature, the third feature center and each fourth feature center, determine the fifth sub-loss value; based on the second sub-feature, the third feature center and each fourth feature center, determine the first Six sub-loss values; based on the fifth sub-loss value and the sixth sub-loss value, a fifth loss value is determined.
  • the second network includes a fifth sub-network and a sixth sub-network
  • the first updating part 63 is further configured to: use the fifth sub-network to combine the first sub-feature and the second sub-feature with at least The second feature of a second object is aggregated to obtain the first aggregated sub-feature corresponding to the first sub-feature and the second aggregated sub-feature corresponding to the second sub-feature
  • the sixth sub-network is used to determine the first aggregated sub-feature based on the first aggregated sub-feature a target sub-feature, and determine a second target sub-feature based on the second aggregated sub-feature.
  • the first updating part 63 is further configured to: determine a first attention matrix based on the first sub-feature and each second feature, and the first attention matrix is used to characterize the first sub-feature and each second feature A degree of association between the second features; based on each second feature and each first attention matrix, determine the first aggregation sub-feature; based on the second sub-feature and each second feature, determine the second attention matrix , the second attention matrix is used to characterize the degree of association between the second sub-feature and each second feature; based on each second feature and each second attention matrix, the second aggregation sub-feature is determined.
  • the sixth sub-network includes a seventh sub-network and an eighth sub-network
  • the first updating part 63 is further configured to: use the seventh sub-network to determine the sixth sub-network based on the first aggregation sub-feature and the occlusion mask Five sub-features, and determine the sixth sub-feature based on the second aggregation sub-feature and the occlusion mask; use the eighth sub-network, based on the first sub-feature and the fifth sub-feature, determine the first target sub-feature, and based on the second sub-feature feature and the sixth sub-feature, determine the second target sub-feature.
  • FIG. 7 is a schematic diagram of the composition and structure of an image recognition device provided by an embodiment of the present disclosure.
  • the image recognition device 70 includes a second acquisition part 71 and identification part 72.
  • a second acquiring part 71 configured to acquire the first image and the second image
  • the identification part 72 is configured to use the trained target model to identify the object in the first image and the object in the second image to obtain a recognition result, wherein the trained target model includes: the model obtained by using the above-mentioned model training method
  • the first model the recognition result indicates that the object in the first image and the object in the second image are the same object or different objects.
  • a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
  • An embodiment of the present disclosure provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the above method when executing the computer program.
  • An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the foregoing method is implemented.
  • Computer readable storage media may be transitory or non-transitory.
  • An embodiment of the present disclosure provides a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part or all of the steps in the above method are implemented.
  • the computer program product can be specifically realized by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in one embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • FIG. 8 is a schematic diagram of a hardware entity of an electronic device in an embodiment of the present disclosure.
  • the hardware entity of the electronic device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein:
  • the processor 801 generally controls the overall operation of the electronic device 800 .
  • the communication interface 802 can enable the electronic device to communicate with other terminals or servers through the network.
  • the memory 803 is configured to store instructions and applications executable by the processor 801, and can also cache data to be processed or processed by the processor 801 and various modules in the electronic device 800 (for example, image data, audio data, voice communication data and Video communication data) can be realized by flash memory (FLASH) or random access memory (Random Access Memory, RAM). Data transmission may be performed between the processor 801 , the communication interface 802 and the memory 803 through the bus 804 .
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are schematic.
  • the division of the units is a logical function division.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integrated
  • the unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
  • the essence of the technical solution of the present disclosure or the part that contributes to related technologies can be embodied in the form of software products, which are stored in a storage medium and include several instructions to make a An electronic device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.
  • Embodiments of the present disclosure provide a model training and image recognition method, device, storage medium, and computer program product.
  • the model training method includes: acquiring a first image sample containing a first object; using the first model to be trained The first network of the first image sample is subjected to feature extraction to obtain the first feature of the first object; the second network of the first model is used to update the first feature based on at least one second feature of the second object, The first target feature corresponding to the first feature is obtained, and the similarity between each second object and the first object is not less than the first threshold; based on the first target feature, the target loss value is determined; based on the target loss value, the first model's The model parameters are updated at least once to obtain the trained first model.
  • the above scheme on the one hand, can enhance the robustness of the first model and improve the performance of the first model; on the other hand, it can improve the consistency of the prediction of the first model after training for different image samples of the same object, and then can This enables the trained first model to more accurately re-identify objects in images containing multiple objects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Des modes de réalisation de la présente divulgation concernent des procédés et des appareils d'entraînement de modèle et de reconnaissance d'image, un dispositif, un support de stockage et un produit-programme informatique. Le procédé d'entraînement de modèle consiste à : acquérir un premier échantillon d'image contenant un premier objet ; effectuer une extraction de caractéristiques sur le premier échantillon d'image à l'aide d'un premier réseau d'un premier modèle à entraîner, en vue d'obtenir une première caractéristique du premier objet ; mettre à jour la première caractéristique sur la base d'une seconde caractéristique d'au moins un second objet, à l'aide d'un second réseau du premier modèle, en vue d'obtenir une première caractéristique cible correspondant à la première caractéristique, un degré de similarité entre chaque second objet et le premier objet n'étant pas inférieur à un premier seuil ; déterminer une valeur de perte cible sur la base de la première caractéristique cible ; et mettre à jour un paramètre de modèle du premier modèle au moins une fois sur la base de la valeur de perte cible, en vue d'obtenir un premier modèle entraîné.
PCT/CN2022/127109 2022-01-28 2022-10-24 Procédés et appareils d'entraînement de modèle et de reconnaissance d'image, dispositif, support de stockage et produit-programme informatique WO2023142551A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210107742.9A CN114445681A (zh) 2022-01-28 2022-01-28 模型训练及图像识别方法和装置、设备及存储介质
CN202210107742.9 2022-01-28

Publications (1)

Publication Number Publication Date
WO2023142551A1 true WO2023142551A1 (fr) 2023-08-03

Family

ID=81371764

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127109 WO2023142551A1 (fr) 2022-01-28 2022-10-24 Procédés et appareils d'entraînement de modèle et de reconnaissance d'image, dispositif, support de stockage et produit-programme informatique

Country Status (2)

Country Link
CN (1) CN114445681A (fr)
WO (1) WO2023142551A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117372818A (zh) * 2023-12-06 2024-01-09 深圳须弥云图空间科技有限公司 目标重识别方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114445681A (zh) * 2022-01-28 2022-05-06 上海商汤智能科技有限公司 模型训练及图像识别方法和装置、设备及存储介质
CN115022282B (zh) * 2022-06-06 2023-07-21 天津大学 一种新型域名生成模型建立及应用
CN115393953B (zh) * 2022-07-28 2023-08-08 深圳职业技术学院 基于异构网络特征交互的行人重识别方法、装置及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329785A (zh) * 2020-11-25 2021-02-05 Oppo广东移动通信有限公司 图像管理方法、装置、终端及存储介质
CN113421192A (zh) * 2021-08-24 2021-09-21 北京金山云网络技术有限公司 对象统计模型的训练方法、目标对象的统计方法和装置
CN113780243A (zh) * 2021-09-29 2021-12-10 平安科技(深圳)有限公司 行人图像识别模型的训练方法、装置、设备以及存储介质
CN114445681A (zh) * 2022-01-28 2022-05-06 上海商汤智能科技有限公司 模型训练及图像识别方法和装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329785A (zh) * 2020-11-25 2021-02-05 Oppo广东移动通信有限公司 图像管理方法、装置、终端及存储介质
CN113421192A (zh) * 2021-08-24 2021-09-21 北京金山云网络技术有限公司 对象统计模型的训练方法、目标对象的统计方法和装置
CN113780243A (zh) * 2021-09-29 2021-12-10 平安科技(深圳)有限公司 行人图像识别模型的训练方法、装置、设备以及存储介质
CN114445681A (zh) * 2022-01-28 2022-05-06 上海商汤智能科技有限公司 模型训练及图像识别方法和装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117372818A (zh) * 2023-12-06 2024-01-09 深圳须弥云图空间科技有限公司 目标重识别方法及装置
CN117372818B (zh) * 2023-12-06 2024-04-12 深圳须弥云图空间科技有限公司 目标重识别方法及装置

Also Published As

Publication number Publication date
CN114445681A (zh) 2022-05-06

Similar Documents

Publication Publication Date Title
WO2023142551A1 (fr) Procédés et appareils d'entraînement de modèle et de reconnaissance d'image, dispositif, support de stockage et produit-programme informatique
Cheng et al. Low-resolution face recognition
WO2021077984A1 (fr) Procédé et appareil de reconnaissance d'objets, dispositif électronique et support de stockage lisible
CN110163115B (zh) 一种视频处理方法、装置和计算机可读存储介质
Su et al. Multi-type attributes driven multi-camera person re-identification
Kao et al. Hierarchical aesthetic quality assessment using deep convolutional neural networks
CN102549603B (zh) 基于相关性的图像选择
Bianco et al. Predicting image aesthetics with deep learning
US20230087863A1 (en) De-centralised learning for re-indentification
CN107003977A (zh) 用于组织存储在移动计算设备上的照片的系统、方法和装置
Dai et al. Cross-view semantic projection learning for person re-identification
Douze et al. The 2021 image similarity dataset and challenge
WO2020224221A1 (fr) Procédé et appareil de suivi, dispositif électronique et support d'informations
CN110516707B (zh) 一种图像标注方法及其装置、存储介质
CN112508094A (zh) 垃圾图片的识别方法、装置及设备
CN112818995B (zh) 图像分类方法、装置、电子设备及存储介质
US20130343618A1 (en) Searching for Events by Attendants
Wieschollek et al. Transfer learning for material classification using convolutional networks
CN114299304B (zh) 一种图像处理方法及相关设备
Guehairia et al. Deep random forest for facial age estimation based on face images
Zhang et al. Multi-level and multi-scale horizontal pooling network for person re-identification
CN111666976A (zh) 基于属性信息的特征融合方法、装置和存储介质
Deng et al. A deep multi-feature distance metric learning method for pedestrian re-identification
Zhang et al. Complementary networks for person re-identification
Islam et al. Large-scale geo-facial image analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923349

Country of ref document: EP

Kind code of ref document: A1