WO2023115911A1

WO2023115911A1 - Object re-identification method and apparatus, electronic device, storage medium, and computer program product

Info

Publication number: WO2023115911A1
Application number: PCT/CN2022/104715
Authority: WO
Inventors: 王皓琦; 王新江; 钟志权; 张伟
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-12-24
Filing date: 2022-07-08
Publication date: 2023-06-29
Also published as: CN114332503A

Abstract

Embodiments of the present invention relate to an object re-identification method and apparatus, an electronic device, a storage medium, and a computer program product. The method comprises: determining an image to be identified that comprises a target object and an image set comprising a candidate image, each candidate image comprising at least one object; and inputting the image to be identified and the image set into a re-identification network to obtain a target candidate image comprising an object matched with the target object. The re-identification network is obtained by means of two-stage training, the first-stage training process is implemented according to a sample image and a corresponding first category label, and the second-stage training process is implemented according to the sample image, a corresponding pseudo label, and the first category label, the pseudo label of each sample image being determined according to the re-identification network after first training. According to the present invention, the performance of the re-identification network is improved by means of two-stage training, thereby improving the accuracy of an identification result.

Description

Object re-identification method and device, electronic device, storage medium and computer program product

Cross References to Related Applications

This application is based on the Chinese patent application with the application number 202111601354.8, the application date is December 24, 2021, and the application title is "object re-identification method and device, electronic equipment and storage medium", and requires the Chinese patent application The priority of this Chinese patent application is hereby incorporated by reference in its entirety into this application.

technical field

The present disclosure relates to the field of computer technology, and in particular to an object re-identification method and device, electronic equipment, storage media and computer program products.

Background technique

Re-identification technology is widely used in various projects, such as re-identification of people, vehicles, and objects. In the application of the real open world, some new situations may appear at any time, and correspondingly some never-before-seen data will be generated. Traditional re-identification algorithms require a large number of sample annotations for training, and when the data set shifts or the field shifts, it needs to re-label new data or samples from new fields, and this type of method consumes a lot of manpower and material resources. At the same time, unsupervised re-identification methods in related technologies often have low accuracy of re-identification results due to influences such as scenes.

Contents of the invention

Embodiments of the present disclosure propose an object re-identification method and device, electronic equipment, a storage medium, and a computer program product, aiming to improve the accuracy of re-identification results through a re-identification model obtained through unsupervised training.

According to a first aspect of an embodiment of the present disclosure, an object re-identification method is provided, including:

Determining an image to be recognized that includes a target object;

determining a set of images comprising at least one candidate image, each of said candidate images comprising an object;

Inputting the image to be recognized and the set of images into a re-identification network to obtain a re-recognition result, if there is a target candidate image in the image set, the re-recognition result includes the target candidate image, so An object included in the target candidate image matches the target object;

Wherein, the re-identification network is obtained through two-stage training, the first-stage training process is implemented according to at least one sample image and the first category label of each of the sample images, and the second-stage training process is based on the at least one sample image, The pseudo-label and the first category label of each of the sample images are realized, the pseudo-label of each of the sample images is determined based on the re-identification network after the first stage of the training process, and the first category label represents the corresponding image category.

According to a second aspect of an embodiment of the present disclosure, an object re-identification device is provided, including:

An image determination module configured to determine an image to be recognized including a target object;

a set determination module configured to determine a set of images comprising at least one candidate image, each of said candidate images comprising an object;

The re-identification module is configured to input the image to be recognized and the image set into a re-identification network to obtain a re-identification result, and if there is a target candidate image in the image set, the re-identification result includes the The target candidate image, the object included in the target candidate image matches the target object;

According to a third aspect of an embodiment of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to call the instructions stored in the memory, to perform the above method.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the foregoing method is implemented.

According to a fifth aspect of the embodiments of the present disclosure, a computer program product is provided, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, the following Part or all of the steps of the methods described in the embodiments of the present disclosure. The computer program product may be a software installation package.

In the embodiment of the present disclosure, the performance of the re-identification network is improved through two-stage training, thereby improving the accuracy of the recognition result.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The drawings here are incorporated into the specification and constitute a part of the specification. These drawings show the embodiments consistent with the present disclosure, and are used together with the description to describe the technical solutions of the embodiments of the present disclosure.

FIG. 1 shows a flowchart of an object re-identification method according to an embodiment of the present disclosure;

Fig. 2 shows a flow chart of training a re-identification network according to an embodiment of the present disclosure;

Fig. 3 shows a schematic diagram of a preset image according to an embodiment of the present disclosure;

Fig. 4 shows a schematic diagram of a sample image according to an embodiment of the present disclosure;

Fig. 5 shows a schematic diagram of determining a sample graph according to an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of a first-stage training process of a re-identification network according to an embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a second-stage training process of a re-identification network according to an embodiment of the present disclosure;

Fig. 8 shows a schematic diagram of an object re-identification device according to an embodiment of the present disclosure;

Fig. 9 shows a schematic diagram of an electronic device according to an embodiment of the present disclosure;

Fig. 10 shows a schematic diagram of another electronic device according to an embodiment of the present disclosure.

Detailed ways

Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures indicate functionally identical or similar elements. While various aspects of the embodiments are shown in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior or better than other embodiments.

The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

In addition, in order to better illustrate the present disclosure, numerous details are given in the following specific embodiments. It will be understood by those skilled in the art that the present disclosure may be practiced without certain of the details. In some instances, methods, means, components and circuits that are well known to those skilled in the art have not been described in detail so as to obscure the gist of the present disclosure.

In a possible implementation manner, the object re-identification method in the embodiment of the present disclosure may be executed by an electronic device such as a terminal device or a server. Wherein, the terminal device may be user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld device, computing device, vehicle-mounted device, Any mobile or fixed terminal such as wearable devices. The server can be a single server or a server cluster composed of multiple servers. Any electronic device can realize the object re-identification method of the embodiment of the present disclosure by calling the computer-readable instructions stored in the memory by the processor.

The object re-identification method of the embodiment of the present disclosure can be applied to re-identify any object, such as a person, a vehicle, and an animal. The re-identification method can search for images or video frames containing specific objects in multiple images or video frame sequences, and can be applied to the application scene of searching for specific people in images collected by multiple cameras, or tracking objects such as pedestrians and vehicles Application scenarios.

Fig. 1 shows a flowchart of an object re-identification method according to an embodiment of the present disclosure. As shown in FIG. 1 , the object re-identification method of the embodiment of the present disclosure may include the following steps S10 to S30.

Step S10, determining the image to be recognized including the target object.

In a possible implementation manner, the image to be recognized may be an image directly obtained by capturing the target object, or an image obtained by intercepting an image obtained by capturing the target object and an area where the target object is located. Wherein, the determination method of the image to be recognized may be collected by an image acquisition device built in or connected to the electronic device, or directly receiving the image to be recognized sent by other devices. The target object can be any movable or non-movable object such as people, animals, vehicles or even furniture.

Step S20, determining an image set including at least one candidate image, each of which includes an object.

In a possible implementation manner, an image set used as a basis for re-identification of the image to be recognized is determined, including at least one candidate image for matching with the image to be recognized. Optionally, the image collection may be pre-stored in the electronic device, or in a database connected to the electronic device. Each candidate image is obtained by collecting similar objects of the target object, and may be an image obtained by directly collecting the object, or an image obtained by intercepting the collected object to obtain an area where the object is located in the image. That is, the objects in each candidate image are the same kind of objects as the target object. For example, when the target object is a person, the object in the candidate image is also a person. When the target object is a vehicle, the object in the candidate image is also a vehicle.

Optionally, each candidate image in the image set also has a corresponding second category label, which is used to characterize the category of the object in the candidate image. For example, when the object in the candidate image is a person, the second category label may be identity information such as the object's name, phone number, and ID card number. When the object in the candidate image is a vehicle, the second category label may be the vehicle's license plate number, vehicle owner information, driving certificate number, and the like.

Step S30, input the image to be recognized and the set of images into a re-recognition network to obtain a re-recognition result.

In a possible implementation, the image to be recognized and the image set are input into the re-identification network, and the candidate image whose object is matched with the target object is determined through the re-identification network among multiple candidate images, and the candidate image is used as the target candidate The image gets a re-identification result. That is, in a case where there is a target candidate image in which the included object matches the target object, the target candidate image may be included in the re-identification result. Optionally, in addition to the target candidate image, the re-identification result may also include the category of the target object. That is, after the target candidate image is determined, the second category label corresponding to the target candidate image is also determined as the second category label of the image to be recognized.

In some embodiments, the detailed process of determining the re-identification result through the re-identification network may be to input the image to be recognized and the image set into the re-identification network, extract the target object features of the image to be recognized through the re-identification network, and each candidate image candidate object features. Then determine the similarity between each candidate image and the image to be recognized according to the characteristics of the target object and the characteristics of each candidate object. In response to the similarity between the candidate image and the image to be recognized satisfying the preset condition, it is determined that the object in the candidate image matches the target object, and the candidate image is used as the target candidate image.

Optionally, when the image to be recognized is an image obtained by directly collecting the target object, the features of the target object can be obtained by intercepting the region where the target object is located in the image to be recognized, and extracting the features of the region through the feature extraction layer of the re-identification network. Similarly, when the candidate image is obtained by directly capturing the object, the features of the candidate object can also be obtained by intercepting the area where the object is located in the candidate image and extracting the features of the area through the feature extraction layer of the re-identification network. The features of the target object and each candidate object can be represented by vectors, and the similarity can be obtained by calculating the distance between the two corresponding vectors in the feature space. The similarity can be calculated by the following formula 1:

Among them, similarity(A,B) is the similarity between A and B, A is the target object feature, B is the candidate object feature, n is the target object feature and the number of elements in the candidate object feature, i represents the current element in the target object feature and the position in the feature of the candidate object, that is, which element is the current element.

In a possible implementation manner, the preset condition may be that the similarity value is the largest and greater than the similarity threshold, that is, the candidate image with the largest similarity value and greater than the similarity threshold is determined as the target candidate image. In some embodiments, the second category label of the target candidate image is determined as the second category label of the image to be recognized, and a re-identification result including the target candidate image and the corresponding second category label is determined. Optionally, when there is no similarity value that satisfies the preset condition, that is, when there is no target candidate image that includes an object that matches the target object, it may be determined that the category of the target object in the current image to be recognized is a new Category, determine that the re-identification result is a new category.

In a possible implementation manner, the re-identification network in the embodiment of the present disclosure is obtained through two-stage training. Wherein, the first-stage training process is implemented according to at least one sample image and the first category label of each sample image, and the second-stage training process is implemented according to at least one sample image, the pseudo-label of each sample image and the first category label, and each The pseudo-labels of the sample images are determined based on the re-identification network after the first-stage training process, and the first category label represents the category of the corresponding image. The sample image is a sample image that has not been manually labeled.

Fig. 2 shows a flow chart of training a re-identification network according to an embodiment of the present disclosure. As shown in FIG. 2 , the training process of the re-identification network in the embodiment of the present disclosure may include the following steps S40 to S80. Optionally, the electronic device that executes steps S40 to S80 may be an electronic device that executes the object re-identification method, or other electronic devices such as terminals or servers.

Step S40, determining at least one preset image including the object.

In a possible implementation manner, each preset image is obtained by capturing at least one object, and each preset image has at least one image frame for marking an area where the object is located and a first category label corresponding to each image frame. Wherein, each preset image has at least one image frame, which is used to mark the area where the object in the preset image is located. The image frame can be annotated by any object annotation method. For example, the pre-trained object recognition model may be input into the pre-trained image to recognize the position of the object included in the pre-set image, and output at least one image frame representing the position of the object. The first category label represents the category of the image region in the corresponding image frame, and can be determined according to the collected object. For example, when two characters are collected by an image acquisition device to obtain a preset image, the position of each character in the preset image can be identified to obtain two corresponding image frames, and each image frame is assigned a corresponding first category label as Person 1 and Person 2.

Optionally, at least one preset image may be determined through random sampling, that is, random sampling is performed on a set of preset images to obtain at least one preset image including an object. Wherein, the preset image set may be pre-stored in the electronic device for training the re-recognition network, or stored in other devices, and the electronic device for training the re-recognition network directly extracts at least one preset image from other electronic devices.

Fig. 3 shows a schematic diagram of a preset image according to an embodiment of the present disclosure. As shown in FIG. 3 , the preset image 30 may include at least one object, and the preset image 30 also has an image frame for marking the position of the object. For example, when the preset image 30 is an image obtained by capturing at least one person, the preset image 30 may have an image frame for representing the location of the face of at least one person. When the preset image 30 includes characters 1 and 2, the preset image 30 has a first image frame 31 representing the area where the face of character 1 is located, and a second image frame 32 representing the area where the face of character 2 is located. Optionally, since the preset image 30 is obtained by capturing two characters, the first category tag corresponding to the first image frame 31 in the preset image 30 can be directly preset as character 1, and the second The first category label corresponding to the image frame 32 is person 2 .

Step S50. Determine at least one sample image corresponding to each preset image according to the corresponding at least one image frame.

In a possible implementation manner, after at least one preset image is determined, at least one sample image corresponding to each preset image is determined according to an image frame corresponding to each preset image. Each sample image is obtained by cropping a part of the preset image. Wherein, multiple sample images may be obtained by cutting out each image frame of the preset image. Optionally, at least one data enhancement may be performed on each preset image, and after each data enhancement, an area in at least one image frame may be intercepted as a sample image. The data enhancement process may include translating the image frame, flipping the image frame, reducing the ratio of the image frame, etc., so that the sample image intercepted after each data enhancement can include different regions of the object.

In some embodiments, since different preset images have different formats and attributes, in order to make the obtained sample images conform to the format required by the training re-identification network, data preprocessing can be performed on each preset image before data enhancement. The data preprocessing process may include any processing methods such as format conversion, image brightness adjustment, and overall noise reduction, and at least one processing method may be pre-selected for data preprocessing as required.

Fig. 4 shows a schematic diagram of a sample image according to an embodiment of the present disclosure. In a possible implementation manner, after the preset image 30 is determined, multiple sample images corresponding to at least one object are obtained by cropping the preset image 30 . When the preset image 30 includes character 1 and character 2, and the image frame on the preset image 30 is the first image frame 31 representing the area where the face of character 1 is located, and the second image frame 32 representing the area where the face of character 2 is located , at least one first object sample image 33 corresponding to the person 1 in the preset image 30 and at least one second object sample image 34 corresponding to the person 2 in the preset image 30 may be determined.

Optionally, before extracting the sample image, first perform data preprocessing on the preset image 30, and then perform operations such as translation, flipping, and size scaling on the first image frame 31 and the second image frame 32 respectively, and intercept after each operation The content in the image frame is used to obtain the corresponding first object sample image 33 and second object sample image 34 .

Fig. 5 shows a schematic diagram of determining a sample graph according to an embodiment of the present disclosure. As shown in FIG. 5 , when the embodiment of the present disclosure determines the sample images used for training the re-identification network, the preset image set 50 including at least one preset image can be determined first, and the preset image set 50 is randomly sampled 51 to obtain At least one preset image 52 includes an object. Perform image preprocessing 53 and data enhancement 54 on the preset image 52 in sequence, and intercept the area within the image frame in each preset image 52 to obtain a sample image 55 .

In a possible implementation, the order of the process of randomly sampling the preset image from the preset image set and the process of extracting the sample image from the preset image can be changed, that is, the preset image can be randomly sampled first Then extract the sample image, or first extract the sample image for each preset image in the preset image set, and then randomly sample. Optionally, the sequence of image preprocessing and data enhancement during sample image extraction can also be adjusted.

Based on the above method, the embodiments of the present disclosure can obtain multiple sample images corresponding to each object through data enhancement, greatly expanding the number of sample images. In some embodiments, the image processing can also be performed in parallel by a GPU (Graphics Processing Unit, graphics processing unit), so as to shorten the image processing speed and reduce unnecessary background noise. At the same time, the embodiment of the present disclosure alleviates the problem that the loss of the training process is difficult to calculate due to too many sample categories by randomly selecting preset images, and makes the extracted preset images representative through random sampling, which can reflect the preset Features of an image collection.

Step S60, performing a first-stage training on the re-identification network according to the sample image and the corresponding first category label.

In a possible implementation manner, after a plurality of sample images are determined in the embodiment of the present disclosure, the re-identification network may be trained in the first stage directly according to each sample image and the first category label. During the training process, the re-identification network can output the first predicted category of the input sample image, and the characterizing re-identification network predicts the category of the object in the input sample image. Since only one object is included in the sample image, the real image category of the sample image is the real object category, and the loss can be calculated according to the real image category and the real object category of the sample image respectively with the first predicted category to obtain the total re-identification Network loss for network conditioning.

Optionally, in order to improve the efficiency of the training process and reduce the cost of manual labeling, manual sample image labeling is not required before re-identification network training, and the first category label of the image frame corresponding to the sample image can be directly used as the first category label of the sample image. A category label, that is, the category of the area where each object position in the preset image is used as the real image category of the sample image obtained by collecting this part of the area. In some embodiments, since most of the preset images only include one object, other preset images include a smaller number of objects. Therefore, in order to improve the labeling efficiency, the actual second category label may not be labeled according to the object category in each sample image. In the first-stage training process of the re-identification network, the first category label of each sample image is directly used as the second category label representing the object category, and the real object category in the sample image is corrected in the second-stage training process .

For example, when the preset image has image frames of three person objects, and the first category labels of each image frame are "person 1", "person 2" and "person 3", respectively extract the A sample image of . When manually labeling the sample images corresponding to each image frame, the identity of each person will be identified in detail, and the corresponding second category tags can be marked as "Zhang San", "Li Si" and "Wang Wu". In order to save time in the labeling process and improve the efficiency of the re-identification network training process, it is not necessary to identify the identity of the person in each sample image, and only quickly label the second class of the sample image corresponding to each image frame by inheriting the first category label. The category labels are "Person 1", "Person 2" and "Person 3".

Based on the above-mentioned method of determining the second category label, the first-stage training process of the re-identification network includes determining the first category label corresponding to each sample image as the second category label, and then inputting each sample image into the re-identification network, and outputting corresponding to the first predicted category. A first network loss is determined according to the first category label, the second category label and the first predicted category corresponding to each sample image, and the re-identification network is adjusted according to the first network loss. Wherein, the first loss may be determined according to the first category label of the sample image and the first predicted category, and the second loss may be determined according to the second category label of the sample image and the first predicted category. That is, the first loss can be determined according to the first category label and the first prediction category corresponding to each sample image, the second loss can be determined according to the second category label and the first prediction category corresponding to each sample image, and then the first loss can be determined according to the first loss and the second loss to determine the first network loss, and adjust the re-identification network according to the first network loss. Wherein, the first network loss can be obtained by calculating the weighted sum of the first loss and the second loss.

In a possible implementation manner, the first loss may be a triplet loss, and the second loss may be a cross-entropy classification loss. That is, the first loss can be obtained by calculating the triplet loss of the first category label and the first predicted category of each sample image, and the second loss can be obtained by calculating the cross entropy of the second category label and the first predicted category of each sample image The classification loss is obtained. Among them, the triplet loss is inversely proportional to the distance between samples of the same object category and proportional to the distance between samples of different object categories. The triplet loss can be reduced by network conditioning, bringing the distance between samples of the same object category closer and the distance between samples of different object categories farther away. The cross-entropy classification loss is inversely proportional to the distance between samples of the same image category, and the cross-entropy classification loss can be reduced through network adjustment to pull in the distance between samples of the same image category.

Optionally, the triplet loss and the cross-entropy classification loss can be calculated by the following formulas 2 and 3, respectively:

Among them, the triplet loss in formula 2 is L _th , P×K is the total number of sample images, a is any sample image, and p is the corresponding feature in multiple sample images with the same first category label as a The sample image with the largest distance between the vector and the feature vector corresponding to a in the feature space, n is the sample image with the smallest distance between the corresponding feature vector and the feature vector corresponding to a in the feature space among multiple sample images different from the first category label of a . α is a preset correction parameter. The cross-entropy classification loss in formula 3 is L, N is the number of sample images, M is the number of second class labels, p _ic is the predicted probability that sample image i belongs to the first predicted class c, y _ic is in the second class of sample i It takes 1 when the category label is c, and takes 0 when it is not c.

After the first network loss is determined, a first-stage adjustment may be performed on the re-identification network until the first network loss satisfies a first preset condition. Wherein, the first preset condition may be that the first network loss is smaller than a preset first threshold. Based on the properties of triplet loss and cross-entropy classification loss, the re-identification network adjusted in the first stage can obtain a well-distributed feature space. That is to say, by adjusting the feature extraction layer of the re-identification network, the re-recognition network can extract similar feature vectors for images of the same image category, and can also extract similar feature vectors for images of the same object category.

Fig. 6 shows a schematic diagram of a first-stage training process of a re-identification network according to an embodiment of the present disclosure. As shown in FIG. 6 , after the sample image 60 is determined in the embodiment of the present disclosure, the first category label 61 corresponding to the sample image 60 and Second category label 62 . Each sample image 60 is input into the re-identification network 63 to obtain a first predicted category 64 , and a first loss 65 is calculated according to the first predicted category 64 and the first category label 61 of each sample image 60 . At the same time, the second loss 66 is calculated according to the first prediction type 64 and the second category label 62 of each sample image 60 , and the re-identification network 63 is jointly adjusted according to the first loss 65 and the second loss 66 . Optionally, the adjustment method may be to calculate the weighted sum of the first loss 65 and the second loss 66 to obtain the first network loss, and perform a first-stage adjustment on the re-identification network 63 until the first network loss meets the first preset condition.

Step S70. Determine the pseudo-label of the sample image according to the re-identification network whose training in the first stage is completed.

In a possible implementation, after the first-stage training process of the re-identification network, the pseudo-label of each sample image can be determined according to the feature space with a relatively reasonable distribution after the re-identification network is trained. Among them, the pseudo-label of each sample image is used to represent the category of the object in the sample image during the second stage of training. Pseudo-labels can be labels of any content, and each pseudo-label is used to uniquely represent a class of objects.

Optionally, the way to determine the pseudo-label according to the re-identification network trained in the first stage can be to input each sample image into the re-identification network after the first stage training, and obtain the feature vector after feature extraction for each sample image . The feature vectors of each sample image are clustered, and identification information uniquely corresponding to each cluster obtained after clustering is determined. The identification information corresponding to each cluster is used as the pseudo-label of the sample image corresponding to each feature vector contained therein. Wherein, the clustering process can be realized based on the k-means clustering algorithm. The unique identification information corresponding to each cluster can be preset or generated according to preset rules.

Step S80 , performing a second-stage training on the re-identification network obtained after the first-stage training according to the sample images and corresponding first category labels and pseudo-labels.

In a possible implementation, after obtaining the pseudo-label of each sample image, the first category label corresponding to each sample image in the first stage of training process is used as the real image category, and the corresponding Pseudo-labels serve as real object categories. In some embodiments, the second-stage training of the re-identification network is performed based on the current real image category of each sample image, the real object category, and the sample image category predicted by the re-identification network. That is to say, each sample image can be input into the re-identification network obtained after the first stage of training, and output the corresponding second predicted category. A second network loss is determined according to the first class label, pseudo-label and second predicted class corresponding to each sample image, and the re-identification network is adjusted according to the second network loss.

Optionally, the same as the first-stage training process of the re-identification network, the second-stage training process calculates the loss according to the real image category and the real object category of the sample image respectively and the second predicted category, and the total re-identification network loss is obtained by Make network adjustments. That is to say, the process of adjusting the re-identification network in the second stage may include determining the third loss according to the first class label and the second predicted class corresponding to each sample image, and determining the third loss according to the pseudo-label corresponding to each sample image and the second predicted class Determine the fourth loss. Then determine the second network loss according to the third loss and the fourth loss, and adjust the re-identification network according to the second network loss. Wherein, the second network loss can be obtained by calculating the weighted sum of the third loss and the fourth loss.

In a possible implementation manner, the third loss may be a triplet loss, and the fourth loss may be a cross-entropy classification loss. That is, the third loss can be obtained by calculating the triplet loss of the first category label and the second prediction category of each sample image, and the fourth loss can be obtained by calculating the pseudo-label of each sample image and the cross-entropy classification loss of the first prediction category get. Among them, the triplet loss is inversely proportional to the distance between samples of the same object category and proportional to the distance between samples of different object categories. The triplet loss can be reduced by network conditioning, bringing the distance between samples of the same object category closer and the distance between samples of different object categories farther away. The cross-entropy classification loss is inversely proportional to the distance between samples of the same image category, and the cross-entropy classification loss can be reduced through network adjustment to pull in the distance between samples of the same image category. Optionally, the calculation process of the third loss may be the same as that of the first loss, and the calculation process of the fourth loss may be the same as that of the second loss.

After the second network loss is determined by calculating the weighted sum of the third loss and the fourth loss, the re-identification network can be adjusted in the second stage until the second network loss satisfies the second preset condition. Wherein, the second preset condition may be that the second network loss is smaller than a preset second threshold. Based on the characteristics of triplet loss and cross-entropy classification loss, the re-identification network adjusted in the second stage can obtain a feature space with a more reasonable distribution. That is to say, by adjusting the feature extraction layer of the re-recognition network, the re-recognition network can more accurately extract similar feature vectors for images of the same image category, and can more accurately extract similar feature vectors for images of the same object category. eigenvectors of .

Fig. 7 shows a schematic diagram of a second-stage training process of a re-identification network according to an embodiment of the present disclosure. As shown in FIG. 7 , in the embodiment of the present disclosure, after the sample image 70 is determined, the first category label 71 corresponding to the sample image 70 and Pseudo-labeling72. Each sample image 70 is input into the re-identification network 73 to obtain the second predicted category 74 , and the third loss 75 is calculated according to the second predicted category 74 and the first category label 71 of each sample image 70 . At the same time, the fourth loss 76 is calculated according to the second prediction type 74 and the pseudo-label 72 of each sample image 70 , and the re-identification network 73 is jointly adjusted according to the third loss 75 and the fourth loss 76 . Optionally, the adjustment method may be to calculate the weighted sum of the third loss 75 and the fourth loss 76 to obtain the second network loss, and perform a first-stage adjustment on the re-identification network 73 until the second network loss meets the second preset condition.

Based on the training method, a high-accuracy re-identification network can be obtained through unlabeled data training at low cost and quickly. The re-identification network can accurately extract similar feature vectors for images of the same image category, and also for images of the same object category. It can accurately extract similar feature vectors to obtain a reasonably distributed feature space. In some embodiments, the accuracy rate of the re-identification network can be obtained through two-stage training, and the image to be recognized can be accurately re-identified through the re-identification network, and an accurate re-identification result can be obtained.

Based on the foregoing embodiments, embodiments of the present disclosure further provide an object re-identification method, which is described in detail below:

The information of the network input data includes the sample image and its type, as well as the bounding box coordinates of the detection network output. Each sample image has its own independent sample label, and the type of the sample image is called a category label. This label At the beginning, no manual labeling is provided, and some samples are randomly selected to use their sample labels as category labels. The data input by the network is divided into training data, query data, and gallery data.

In the re-identification process, each sample in the training data first undergoes specific data preprocessing, and the defect bounding box with the output of the detection network is generated for each sample by data enhancement. More equivalent data, these equivalent The data share common sample labels and common category labels with the original samples. In the first step, the algorithm feeds data and labels into a deep learning neural network, which is characterized by two loss functions. The cross-entropy classification loss function uses randomly selected sample labels as the category label as the true value, and the triplet loss function uses the sample label as the true value. After network learning and training, a feature space will be obtained, and k-means clustering will be performed on this feature space, and then all samples will get pseudo-labels according to this, and the second learning will be carried out. In the second learning, the cross-entropy classification loss function uses the pseudo labels as ground truth, and the triplet loss function uses the sample labels as ground truth. After the network is relearned, the feature space distribution is more comprehensive and reasonable. Input the query data and gallery data into the network after preprocessing, calculate the similarity between the query data and the gallery data in the feature space, and use the similarity to judge whether the query data belongs to certain categories.

(1) Image preprocessing;

Perform unified image preprocessing on each different image, and then perform frame cutting and data enhancement on the processed image, which can greatly improve the image processing speed. Data enhancement methods include translation, flipping, and reducing the proportion of defective frames to Capture a larger field of view map.

(2) random sampling;

A very important step in the algorithm is random sampling and the sample label of the extracted sample is used as the category label for preliminary training. This method frees the samples from the cost of a large number of manual labels, and alleviates the problem that too many category labels are difficult to calculate the cross-entropy classification loss. At the same time, the randomly selected sample labels are still representative, reflecting the appearance and characteristics of the data set.

(3) Cross-entropy classification loss function;

Using the category label as the true value of the cross-entropy classification loss function instead of using the sample label as the true value allows the model to learn quickly with a large number of samples. This allows samples with the same class label to be close to each other.

(4) Triplet loss function;

Using triple loss function plus difficult sample mining, for each sample, take the maximum distance of positive samples (ie, the most difficult positive sample) and the minimum distance of negative samples (ie, the most difficult negative sample) as the optimization of the loss function The goal is to reduce the distance between positive samples and increase the distance between negative samples to obtain a better feature space to ensure effective learning. At the same time, adding the background image (unblemished image) as a negative sample to the training sample can make the neural network better compare the difference between positive and negative samples, thereby improving the effect. The triplet loss function is determined by the following formula four:

(5) Clustering;

Carry out k-means clustering on the model feature space completed by the first training, k is an adjustable parameter defined by the user, and the number of categories can be further reduced according to the distribution of data to make the samples of the same category closer together. Moreover, after clustering, all samples can be assigned a pseudo-label, which is the label that the sample belongs to the cluster. The samples that have not been randomly drawn before are added to the cross-entropy classification training to further optimize the feature space distribution. The k-means clustering optimization algorithm is determined by the following formula 5, S is the set of all random sampling samples, and μ is the average value in class i.

(6) The comparison between stage one training and stage two training;

What they have in common: Both use the sample labels of all samples as the ground truth for the triplet classification loss. It can shorten the distance between positive samples and shorten the distance between negative samples.

The difference: Stage 1 training uses randomly sampled sample labels as the true value of the cross-entropy loss function, and stage 2 uses the pseudo-labels of all samples obtained after clustering as the true value of the cross-entropy loss function. The number of labels obtained by random sampling and clustering can maintain a constant value or a small incremental value after the data surge, so as to ensure that the calculation cost of cross entropy will not increase sharply with the data surge.

(7) query matching;

After the model is trained, you only need to input the image to be detected and the image gallery into the network to obtain the matching result of the image to be detected and the samples in the image gallery. You can determine whether the image to be detected belongs to a certain type of sample and whether it belongs which type of sample. The method of comparing the similarity is to calculate the cosine distance of the samples in the feature space, and the similarity can be determined by the following formula 6. If the query sample has a sample library sample that exceeds a certain pre-value similarity, it is classified into this class, otherwise it is classified into a new class that does not belong to the sample library sample. A and B represent the feature matrix of the sample.

Therefore, the object re-identification method in the embodiment of the present application can achieve the following technical effects:

1) This technique uses an unsupervised re-identification network to assist a classification network to learn to quickly classify images.

2) Improve accuracy. The network can re-identify and classify images misjudged as a certain category, improve the distribution of the feature space on the basis of the original classification, and refine the feature level, so that the network does not only learn a fuzzy category, but learns each category. The comparison and distribution between samples can improve the original classification accuracy and recall rate.

3) It is cost-effective. When encountering the category to be judged on some real-time running projects, there is no need to retrain a huge classification model, and the re-identification model can be used to achieve accurate and fast recognition of the category. Moreover, unlike the traditional rediscovery technology, this model can be learned without manual labeling, which saves a lot of manpower and material costs.

4) Ability to recognize new categories. When an unseen sample appears, the ordinary classification network will classify it into one of the known categories, while the re-identification network will filter the samples with low confidence, so as to achieve the effect of identifying new categories.

5) Quickly process images. The network uses rendering technology to process each image, and the data enhancement method greatly expands the sample size. It is very time-consuming to process each image separately under such a mechanism, and the method of rendering first, then cutting the frame, and some The method of GPU optimization acceleration can greatly shorten the image processing speed, and can also reduce unnecessary background noise.

It can be understood that the above-mentioned method embodiments mentioned in this disclosure can all be combined with each other to form a combined embodiment without violating the principle and logic. Those skilled in the art can understand that, in the above method of the embodiment, the actual execution order of each step should be determined by its function and possible internal logic.

In addition, the embodiments of the present disclosure also provide object re-identification devices, electronic equipment, computer-readable storage media, and programs, all of which can be used to implement any object re-identification method provided by the present disclosure, corresponding technical solutions and descriptions, and refer to methods Part of the corresponding record.

Fig. 8 shows a schematic diagram of an object re-identification device according to an embodiment of the present disclosure. As shown in FIG. 8 , the object re-identification apparatus of the embodiment of the present disclosure may include an image determination module 80 , a set determination module 81 and a re-identification module 82 .

An image determining module 80 configured to determine an image to be recognized including a target object;

A set determination module 81 configured to determine an image set comprising at least one candidate image, each of which includes an object;

The re-identification module 82 is configured to input the image to be recognized and the image set into a re-identification network to obtain a re-identification result, and if there is a target candidate image in the image set, the re-identification result includes The target candidate image, the object included in the target candidate image matches the target object;

In a possible implementation manner, each of the candidate images has a corresponding second category label, and the second category label represents the category of the object in the corresponding image;

The device also includes:

The label determination module is configured to determine the second category label corresponding to the target candidate image as the second category label of the image to be recognized.

In a possible implementation, the training process of the re-identification network includes:

determining at least one preset image including an object, each of the preset images having at least one image frame for marking the area where the object is located, and a first category label corresponding to each of the image frames;

determining at least one sample image corresponding to each of the preset images according to the corresponding at least one image frame;

performing a first-stage training on the re-identification network according to the sample image and the corresponding first category label;

Determining the pseudo-label of the sample image according to the re-identification network completed in the first stage of training;

The second-stage training is performed on the re-identification network obtained after the first-stage training according to the sample images and the corresponding first category labels and pseudo-labels.

In a possible implementation manner, the determining at least one preset image including an object includes:

Random sampling is performed on the set of preset images to obtain at least one preset image including an object.

In a possible implementation manner, the determining at least one sample image corresponding to each preset image according to the corresponding at least one image frame includes:

Perform at least one data enhancement on each preset image, and intercept at least one area in the image frame as a sample image after each data enhancement.

In a possible implementation manner, before data enhancement is performed on each of the preset images, image preprocessing is performed on the preset images.

In a possible implementation manner, the first-stage training of the re-identification network according to the sample image and the corresponding first category label includes:

Determining the first category label corresponding to each of the sample images as the second category label;

Input each sample image into the re-identification network, and output the first prediction category corresponding to the sample image;

A first network loss is determined according to the first category label, the second category label and the first predicted category corresponding to each of the sample images, and the re-identification network is adjusted according to the first network loss.

In a possible implementation manner, the first network loss is determined according to the first category label, the second category label, and the first predicted category corresponding to each of the sample images, and the first network loss is adjusted according to the first network loss. The re-identification network includes:

determining a first loss according to a first category label and a first predicted category corresponding to each of the sample images;

determining a second loss according to the second category label and the first predicted category corresponding to each of the sample images;

A first network loss is determined based on the first loss and the second loss, and the re-identification network is adjusted based on the first network loss.

In a possible implementation manner, the determining the pseudo-label of the sample image according to the re-identification network whose training is completed in the first stage includes:

Each of the sample images is input into the re-identification network that has been trained in the first stage to obtain a feature vector after feature extraction is performed on each of the sample images;

Clustering the feature vectors of each of the sample images, and determining identification information uniquely corresponding to each cluster obtained after clustering;

The identification information corresponding to each of the clusters is used as a pseudo-label corresponding to the sample image included in each feature vector.

In a possible implementation manner, the clustering process is implemented based on a k-means clustering algorithm.

In a possible implementation manner, the second-stage training of the re-identification network obtained after the first-stage training according to the sample image and the corresponding first category label and pseudo-label includes:

Input each of the sample images into the re-identification network obtained after the first stage of training, and output the corresponding second predicted category;

A second network loss is determined according to the first class label, pseudo-label and second predicted class corresponding to each of the sample images, and the re-identification network is adjusted according to the second network loss.

In a possible implementation manner, the second network loss is determined according to the first category label, pseudo-label and second prediction category corresponding to each of the sample images, and the weighting is adjusted according to the second network loss. Identification networks include:

determining a third loss according to the first category label and the second predicted category corresponding to each of the sample images;

determining a fourth loss according to a pseudo-label corresponding to each of the sample images and a second prediction category;

A second network loss is determined based on the third loss and the fourth loss, and the re-identification network is adjusted based on the second network loss.

In a possible implementation manner, the first loss and/or the third loss is a triplet loss, and the second loss and/or the fourth loss is a cross-entropy classification loss.

In a possible implementation manner, the re-identification module 82 includes:

The image input submodule is configured to input the image to be recognized and the set of images into a re-identification network, and extract the target object features of the image to be recognized through the re-identification network, and the candidate of each of the candidate images object characteristics;

A similarity matching submodule configured to determine the similarity between each of the candidate images and the image to be recognized according to the characteristics of the target object and the characteristics of each of the candidate objects;

The result output submodule is configured to, in response to the similarity between the candidate image and the image to be recognized satisfying a preset condition, determine that the object in the candidate image matches the target object, and use the candidate image as the target candidate image Get re-identification results.

In a possible implementation manner, the preset condition includes that the similarity value is the largest and greater than a similarity threshold.

In some embodiments, the functions or modules included in the apparatus provided by the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments, and for detailed implementation, refer to the descriptions of the above method embodiments.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor. Computer readable storage media may be volatile or nonvolatile computer readable storage media.

An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.

Electronic devices may be provided as terminals, servers, or other forms of devices.

FIG. 9 shows a schematic diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.

9, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and a communication component 816 .

The processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .

The memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

The power supply component 806 provides power to various components of the electronic device 800 . Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .

The multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, multimedia component 808 includes a front camera and/or rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 800 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 804 or sent via communication component 816 . In some embodiments, the audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of electronic device 800 . For example, the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.

FIG. 10 shows a schematic diagram of another electronic device 1900 according to an embodiment of the present disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 10 , electronic device 1900 includes processing component 1922 , which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions executable by processing component 1922 , such as application programs. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above method.

Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 . The electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows Server ^TM ), the graphical user interface-based operating system (Mac OS X ^TM ) introduced by Apple Inc., and the multi-user and multi-process computer operating system (Unix ^™ ), a free and open source Unix-like operating system (Linux ^™ ), an open source Unix-like operating system (FreeBSD ^™ ), or the like.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.

The present disclosure can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

The computer program product can be specifically realized by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.

Having described various embodiments of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

An object re-identification method, wherein the method includes:

Determining an image to be recognized that includes a target object;

determining a set of images comprising at least one candidate image, each of said candidate images comprising an object;

Inputting the image to be recognized and the set of images into a re-identification network to obtain a re-recognition result, if there is a target candidate image in the image set, the re-recognition result includes the target candidate image, so An object included in the target candidate image matches the target object;

Wherein, the re-identification network is obtained through two-stage training, the first-stage training process is implemented according to at least one sample image and the first category label of each of the sample images, and the second-stage training process is based on the at least one sample image, The pseudo-label and the first category label of each of the sample images are realized, the pseudo-label of each of the sample images is determined based on the re-identification network after the first stage of the training process, and the first category label represents the corresponding image category.
The method according to claim 1, wherein the training process of the re-identification network comprises:

determining at least one preset image including an object, each of the preset images having at least one image frame for marking the area where the object is located, and a first category label corresponding to each of the image frames;

determining at least one sample image corresponding to each of the preset images according to the corresponding at least one image frame;

performing a first-stage training on the re-identification network according to the sample image and the corresponding first category label;

Determining the pseudo-label of the sample image according to the re-identification network completed in the first stage of training;

The second-stage training is performed on the re-identification network obtained after the first-stage training according to the sample images and the corresponding first category labels and pseudo-labels.
The method of claim 2, wherein said determining at least one preset image comprising an object comprises:

Random sampling is performed on the set of preset images to obtain at least one preset image including an object.
The method according to claim 2 or 3, wherein said determining at least one sample image corresponding to each of said preset images according to the corresponding at least one said image frame comprises:

performing at least one data enhancement on each of the preset images, and intercepting at least one area in the image frame as a sample image after each data enhancement;

and / or,

Before data enhancement is performed on each of the preset images, image preprocessing is performed on the preset images.
The method according to any one of claims 2 to 4, wherein each of the candidate images has a corresponding second category label, the second category label representing the category of the object in the corresponding image;

The method also includes:

Determining the second category label corresponding to the target candidate image as the second category label of the image to be recognized;

and / or,

The first-stage training of the re-identification network according to the sample image and the corresponding first category label includes:

Determining the first category label corresponding to each of the sample images as the second category label;

Input each sample image into the re-identification network, and output the first prediction category corresponding to the sample image;

A first network loss is determined according to the first category label, the second category label and the first predicted category corresponding to each of the sample images, and the re-identification network is adjusted according to the first network loss.
The method according to claim 5, wherein the first network loss is determined according to the first category label, the second category label and the first predicted category corresponding to each of the sample images, and according to the first network loss Adjusting the re-identification network includes:

determining a first loss according to a first category label and a first predicted category corresponding to each of the sample images;

determining a second loss according to the second category label and the first predicted category corresponding to each of the sample images;

A first network loss is determined based on the first loss and the second loss, and the re-identification network is adjusted based on the first network loss.
The method according to any one of claims 2 to 6, wherein the determination of the pseudo-label of the sample image according to the re-identification network at the end of the first stage of training comprises:

Each of the sample images is input into the re-identification network that has been trained in the first stage to obtain a feature vector after feature extraction is performed on each of the sample images;

Clustering the feature vectors of each of the sample images, and determining identification information uniquely corresponding to each cluster obtained after clustering;

Using the identification information corresponding to each of the clusters as a pseudo-label corresponding to each feature vector included in the sample image;

Wherein, the clustering process is implemented based on the k-means clustering algorithm.
The method according to any one of claims 2 to 7, wherein the second step is performed on the re-identification network obtained after the first stage of training according to the sample image and the corresponding first category label and pseudo-label. Stage training includes:

Input each of the sample images into the re-identification network obtained after the first stage of training, and output the corresponding second predicted category;

A second network loss is determined according to the first class label, pseudo-label and second predicted class corresponding to each of the sample images, and the re-identification network is adjusted according to the second network loss.
The method according to claim 8, wherein the second network loss is determined according to the first category label, pseudo-label and second predicted category corresponding to each of the sample images, and the second network loss is adjusted according to the second network loss. The re-identification network includes:

determining a third loss according to the first category label and the second predicted category corresponding to each of the sample images;

determining a fourth loss according to a pseudo-label corresponding to each of the sample images and a second prediction category;

A second network loss is determined based on the third loss and the fourth loss, and the re-identification network is adjusted based on the second network loss.
The method according to any one of claims 6 to 9, wherein the first loss and/or the third loss is a triplet loss, and the second loss and/or the fourth loss is Cross-entropy classification loss.
The method according to any one of claims 1 to 10, wherein said inputting said image to be recognized and said set of images into a re-recognition network, and obtaining a re-recognition result comprises:

The image to be recognized and the set of images are input into a re-identification network, and the target object features of the image to be recognized are extracted through the re-identification network, and the candidate object features of each of the candidate images;

determining the similarity between each of the candidate images and the image to be recognized according to the characteristics of the target object and the characteristics of each of the candidate objects;

Responding to the fact that the similarity between the candidate image and the image to be recognized satisfies a preset condition, it is determined that the object in the candidate image matches the target object, and the candidate image is used as the target candidate image to obtain a re-identification result; the The preset condition includes that the similarity value is the largest and greater than the similarity threshold.
An object re-identification device, wherein the device includes:

An image determination module configured to determine an image to be recognized including a target object;

a set determination module configured to determine a set of images comprising at least one candidate image, each of said candidate images comprising an object;

The re-identification module is configured to input the image to be recognized and the image set into a re-identification network to obtain a re-identification result, and if there is a target candidate image in the image set, the re-identification result includes the The target candidate image, the object included in the target candidate image matches the target object;

Wherein, the re-identification network is obtained through two-stage training, the first-stage training process is implemented according to at least one sample image and the first category label of each of the sample images, and the second-stage training process is based on the at least one sample image, The pseudo-label and the first category label of each of the sample images are realized, the pseudo-label of each of the sample images is determined based on the re-identification network after the first stage of the training process, and the first category label represents the corresponding image category.
An electronic device, comprising:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1-11.
A computer-readable storage medium on which computer program instructions are stored, wherein the computer program instructions implement the method according to any one of claims 1 to 11 when executed by a processor.
A computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, any one of claims 1 to 11 is realized the method described.