CN114332503A

CN114332503A - Object re-identification method and device, electronic equipment and storage medium

Info

Publication number: CN114332503A
Application number: CN202111601354.8A
Authority: CN
Inventors: 王皓琦; 王新江; 钟志权; 张伟
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-12
Also published as: WO2023115911A1

Abstract

The present disclosure relates to an object re-recognition method and apparatus, an electronic device, and a storage medium by determining an image to be recognized including a target object and an image set including candidate images each including at least one object. And inputting the image to be recognized and the image set into a re-recognition network to obtain a target candidate image of which the included object is matched with the target object. The re-recognition network is obtained through two-stage training, the first-stage training process is achieved according to the sample images and the corresponding first class labels, the second-stage training process is achieved according to the sample images, the corresponding pseudo labels and the first class labels, and the pseudo label of each sample image is determined according to the re-recognition network after the first training. The method and the device guarantee the performance of the re-recognition network through two-stage training, and improve the accuracy of recognition results.

Description

Object re-identification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an object re-identification method and apparatus, an electronic device, and a storage medium.

Background

The re-recognition technology is widely used in various items such as re-recognition of persons, vehicles, and articles. In real open world applications, new situations may arise at any time, and accordingly, data that has never been seen is also generated. The traditional re-recognition algorithm needs a large amount of sample labels for training, and when the data set is shifted or the domain is shifted, new data or samples in a new domain need to be re-labeled, and the method needs to consume a large amount of manpower and material resources. Meanwhile, the unsupervised re-recognition method in the related technology often causes low accuracy of re-recognition results due to the influence of scenes and the like.

Disclosure of Invention

The disclosure provides an object re-recognition method and device, electronic equipment and a storage medium, and aims to improve the accuracy of a re-recognition result through a re-recognition model obtained through unsupervised training.

According to a first aspect of the present disclosure, there is provided an object re-identification method, including:

determining an image to be recognized including a target object;

determining a set of images comprising at least one candidate image, each of the candidate images comprising an object;

inputting the image to be recognized and the image set into a re-recognition network to obtain a re-recognition result, wherein under the condition that a target candidate image exists in the image set, the re-recognition result comprises the target candidate image, and an object included in the target candidate image is matched with the target object;

the re-recognition network is obtained through two-stage training, the first-stage training process is achieved according to at least one sample image and a first class label of each sample image, the second-stage training process is achieved according to the at least one sample image, a pseudo label of each sample image and the first class label, the pseudo label of each sample image is determined based on the re-recognition network after the first-stage training process is finished, and the first class label represents the class of the corresponding image.

In one possible implementation, each of the candidate images has a corresponding second class label, which characterizes a class of an object in the corresponding image;

the method further comprises the following steps:

and determining a second class label corresponding to the target candidate image as the second class label of the image to be identified.

In one possible implementation, the training process of the re-recognition network includes:

determining at least one preset image comprising an object, wherein each preset image is provided with at least one image frame for marking the area where the object is located and a first class label corresponding to each image frame;

determining at least one sample image corresponding to each preset image according to the corresponding at least one image frame;

performing first-stage training on the re-recognition network according to the sample image and the corresponding first class label;

determining a pseudo label of the sample image according to the re-recognition network after the training of the first stage is finished;

and performing second-stage training on the re-recognition network obtained after the first-stage training according to the sample image, the corresponding first class label and the pseudo label.

In one possible implementation, the determining at least one preset image including an object includes:

and randomly sampling the preset image set to obtain at least one preset image comprising the object.

In a possible implementation manner, the determining, according to the corresponding at least one image frame, at least one sample image corresponding to each preset image includes:

and performing data enhancement on each preset image at least once, and intercepting at least one area in the image frame after each data enhancement as a sample image.

In a possible implementation manner, before performing data enhancement on each preset image, image preprocessing is performed on the preset images.

In a possible implementation manner, the performing, by the first stage, training on the re-recognition network according to each sample image and the corresponding first class label includes:

determining the first class label corresponding to each sample image as a second class label;

inputting each sample image into the re-recognition network, and outputting a first prediction category corresponding to the sample image;

and determining a first network loss according to the first class label, the second class label and the first prediction class corresponding to each sample image, and adjusting the re-identification network according to the first network loss.

In one possible implementation, the determining a first network loss according to the first class label, the second class label, and the first prediction class corresponding to each sample image, and adjusting the re-identification network according to the first network loss includes:

determining a first loss according to the first class label and the first prediction class corresponding to each sample image;

determining a second loss according to the second class label and the first prediction class corresponding to each sample image;

determining a first network loss based on the first loss and the second loss, and adjusting the re-identified network based on the first network loss.

In one possible implementation, the determining the pseudo label of the sample image according to the re-recognition network at the end of the first stage training includes:

inputting each sample image into the re-recognition network after the training of the first stage is finished, and obtaining a feature vector after feature extraction is carried out on each sample image;

clustering the characteristic vectors of each sample image, and determining identification information uniquely corresponding to each clustered cluster obtained after clustering;

and taking the identification information corresponding to each cluster as a pseudo label of each sample image corresponding to each feature vector included in the identification information.

In one possible implementation, the clustering process is implemented based on a k-means clustering algorithm.

In a possible implementation manner, the performing, according to the sample image, the corresponding first class label and the pseudo label, the second-stage training on the re-recognition network obtained after the first-stage training includes:

inputting each sample image into the re-recognition network obtained after the first-stage training, and outputting a corresponding second prediction category;

and determining a second network loss according to the first class label, the pseudo label and the second prediction class corresponding to each sample image, and adjusting the re-identification network according to the second network loss.

In one possible implementation manner, the determining a second network loss according to the first class label, the pseudo label, and the second prediction class corresponding to each sample image, and adjusting the re-identification network according to the second network loss includes:

determining a third loss according to the first class label and the second prediction class corresponding to each sample image;

determining a fourth loss according to the pseudo label corresponding to each sample image and the second prediction category;

determining a second network loss based on the third loss and the fourth loss, and adjusting the re-identified network based on the second network loss.

In one possible implementation, the first loss and/or the third loss is a triplet loss, and the second loss and/or the fourth loss is a cross-entropy classification loss.

In a possible implementation manner, the inputting the image to be recognized and the image set into a re-recognition network to obtain a re-recognition result includes:

inputting the image to be recognized and the image set into a re-recognition network, and extracting the target object characteristics of the image to be recognized and the candidate object characteristics of each candidate image through the re-recognition network;

determining the similarity of each candidate image and the image to be identified according to the target object characteristics and each candidate object characteristic;

and in response to the fact that the similarity between the candidate image and the image to be recognized meets a preset condition, determining that an object in the candidate image is matched with the target object, and taking the candidate image as a target candidate image to obtain a re-recognition result.

In a possible implementation manner, the preset condition is that the similarity value is maximum and is greater than a similarity threshold value.

According to a second aspect of the present disclosure, there is provided an object re-recognition apparatus including:

the image determining module is used for determining an image to be recognized comprising a target object;

a set determination module for determining a set of images comprising at least one candidate image, each of said candidate images comprising an object;

the re-recognition module is used for inputting the image to be recognized and the image set into a re-recognition network to obtain a re-recognition result, and under the condition that a target candidate image exists in the image set, the re-recognition result comprises the target candidate image, and an object comprised in the target candidate image is matched with the target object;

the device further comprises:

and the label determining module is used for determining that the second class label corresponding to the target candidate image is the second class label of the image to be identified.

In one possible implementation manner, the performing, according to the sample image and the corresponding first class label, a first stage training on the re-recognition network includes:

In one possible implementation, the re-identification module includes:

the image input sub-module is used for inputting the image to be recognized and the image set into a re-recognition network, and extracting the target object characteristics of the image to be recognized and the candidate object characteristics of each candidate image through the re-recognition network;

the similarity matching submodule is used for determining the similarity between each candidate image and the image to be identified according to the target object characteristics and each candidate object characteristic;

and the result output sub-module is used for responding that the similarity between the candidate image and the image to be recognized meets a preset condition, determining that an object in the candidate image is matched with the target object, and taking the candidate image as a target candidate image to obtain a re-recognition result.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the performance of the re-recognition network is ensured through two-stage training, and the accuracy of the recognition result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a method of object re-identification in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram for training a re-recognition network in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a preset image according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a sample image according to an embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of a determination of a sample graph according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a first stage training process for a re-recognition network according to an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating a second stage training process for a re-recognition network according to an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of an object re-identification apparatus according to an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of an electronic device in accordance with an embodiment of the disclosure;

FIG. 10 shows a schematic diagram of another electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

In a possible implementation manner, the object re-identification method of the embodiment of the disclosure may be executed by an electronic device such as a terminal device or a server. The terminal device may be any mobile or fixed terminal such as a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, and a wearable device. The server may be a single server or a server cluster of multiple servers. Any electronic device may implement the object re-recognition method of the embodiments of the present disclosure by way of a processor invoking computer readable instructions stored in a memory.

The object re-identification method of the embodiment of the present disclosure can be applied to re-identification of any object, such as a person, a vehicle, an animal, and the like. The re-identification method can search images or video frames comprising specific objects in a plurality of image or video frame sequences, and can be applied to searching application scenes of specific people in images acquired by a plurality of cameras or tracking application scenes of objects such as pedestrians, vehicles and the like.

Fig. 1 shows a flowchart of an object re-identification method according to an embodiment of the present disclosure. As shown in fig. 1, the object re-recognition method of the embodiment of the present disclosure may include the following steps S10-S30.

Step S10, determining an image to be recognized including the target object.

In a possible implementation manner, the image to be recognized may be an image directly obtained by acquiring the target object, or an image obtained by intercepting an area where the target object is located in the image obtained by acquiring the target object. The determination mode of the image to be recognized may be acquired by an image acquisition device built in or connected to the electronic device, or may be directly receiving the image to be recognized sent by another device. The target object may be any movable object or immovable object such as a person, an animal, a vehicle, or even furniture.

Step S20, determining a set of images comprising at least one candidate image, each candidate image comprising an object.

In one possible implementation manner, an image set used as a basis for re-recognition of the image to be recognized is determined, wherein the image set includes at least one candidate image used for matching with the image to be recognized. Alternatively, the image set may be stored in the electronic device in advance, or in a database connected to the electronic device. Each candidate image is obtained by acquiring the same kind of object of the target object, and can be an image obtained by directly acquiring the object or an image obtained by intercepting the acquired object to obtain the area where the object is located in the image. That is, the object in each candidate image is a homogeneous object of the target object. For example, when the target object is a person, the object in the candidate image is also a person. When the target object is a vehicle, the object in the candidate image is also a vehicle.

Optionally, each candidate image in the image set further has a corresponding second class label for characterizing the class of the object in the candidate image. For example, when the object in the candidate image is a person, the second category label may be identity information such as the name, telephone number, and identity document number of the object. When the object in the candidate image is a vehicle, the second category label may be a license plate number, owner information, a driving certificate number, and the like of the vehicle.

And step S30, inputting the image to be recognized and the image set into a re-recognition network to obtain a re-recognition result.

In a possible implementation manner, the image to be recognized and the image set are input into a re-recognition network, a candidate image in which an object included in the candidate image is matched with a target object is determined through the re-recognition network, and the candidate image is used as a target candidate image to obtain a re-recognition result. That is, in the case where there is a target candidate image in which the included object matches the target object, the target candidate image may be included in the re-recognition result. Optionally, the re-recognition result may include a category of the target object in addition to the target candidate image. Namely, after the target candidate image is determined, the second class label corresponding to the target candidate image is also determined to be the second class label of the image to be identified.

Further, the specific process of determining the re-recognition result through the re-recognition network may be to input the image to be recognized and the image set into the re-recognition network, and extract the target object feature of the image to be recognized and the candidate object feature of each candidate image through the re-recognition network. And determining the similarity between each candidate image and the image to be identified according to the target object characteristics and each candidate object characteristic. And in response to the fact that the similarity between the candidate image and the image to be recognized meets a preset condition, determining that an object in the candidate image is matched with a target object, and taking the candidate image as a target candidate image.

Optionally, when the image to be recognized is an image obtained by directly acquiring the target object, the target object feature may be obtained by intercepting an area where the target object is located in the image to be recognized and extracting features of the area through a feature extraction layer of the re-recognition network. Similarly, when the candidate image is an image obtained by directly acquiring the object, the candidate object feature may also be obtained by intercepting the region where the object is located in the candidate image and extracting the feature of the region through a feature extraction layer of the re-recognition network. The target object feature and each candidate object feature can be represented by a vector, and the similarity can be obtained by calculating the distance between the two corresponding vectors in the feature space. The similarity can be calculated by the following formula one:

the similarity (a, B) is the similarity between a and B, a is the target object feature, B is the candidate object feature, n is the number of elements in the target object feature and the candidate object feature, and i represents the position of the current element in the target object feature and the candidate object feature, that is, the current element is the second element.

In a possible implementation manner, the preset condition may be that the similarity value is maximum and is greater than a similarity threshold, that is, the candidate image with the maximum similarity value and greater than the similarity threshold is determined as the target candidate image. Further, the second class label of the target candidate image is determined to be the second class label of the image to be recognized, and a re-recognition result comprising the target candidate image and the corresponding second class label is determined. Alternatively, when there is no similarity value satisfying the preset condition, that is, there is no target candidate image in which the included object matches the target object, the category of the target object in the current image to be recognized may be determined as a new category, and the re-recognition result may be determined as the new category.

In one possible implementation, the re-recognition network of the embodiments of the present disclosure is obtained through two-stage training. The first-stage training process is realized according to at least one sample image and a first class label of each sample image, the second-stage training process is realized according to at least one sample image, a pseudo label of each sample image and the first class label, the pseudo label of each sample image is determined based on a re-recognition network after the first-stage training process is finished, and the first class label represents the class of the corresponding image. The sample image is a sample image without manual labeling.

FIG. 2 illustrates a flow diagram for training a re-recognition network according to an embodiment of the present disclosure. As shown in FIG. 2, the training process of re-identifying the network according to the embodiment of the present disclosure may include the following steps S40-S80. Alternatively, the electronic device performing steps S40-S80 may be an electronic device performing the object re-recognition method, or another electronic device such as a terminal or a server.

Step S40, determining at least one preset image including the object.

In a possible implementation manner, each preset image is obtained by acquiring at least one object, and each preset image has at least one image frame for labeling an area where the object is located and a first category label corresponding to each image frame. Each preset image is provided with at least one image frame and is used for marking the area where the object in the preset image is located. The image frame can be obtained by labeling in any object labeling mode. For example, the preset image may be input into a pre-trained object recognition model to recognize the object position included in the preset image, and at least one image frame representing the object position may be output. The first class label characterizes a class of an image region within the corresponding image frame, which may be determined from the acquired object. For example, when two persons are captured by the image capturing device to obtain a preset image, the position of each person in the preset image may be identified to obtain two corresponding image frames, and each image frame is assigned with the corresponding first category tags as person 1 and person 2.

Alternatively, the at least one preset image may be determined by random sampling, i.e. randomly sampling the preset image set to obtain the at least one preset image including the object. The preset image set may be stored in the electronic device for training the re-recognition network in advance, or stored in other devices, and the electronic device for training the re-recognition network directly extracts at least one preset image from the other electronic devices.

Fig. 3 shows a schematic diagram of a preset image according to an embodiment of the present disclosure. As shown in fig. 3, the preset image 30 may include at least one object therein, and the preset image 30 further has an image frame for marking the position of the object. For example, when the preset image 30 is an image obtained by capturing at least one person, the preset image 30 may have an image frame for representing a position where the face of the at least one person is located. When the person 1 and the person 2 are included in the preset image 30, the preset image 30 has a first image frame 31 representing the area where the face of the person 1 is located, and a second image frame 32 representing the area where the face of the person 2 is located. Optionally, since the preset image 30 is obtained by capturing two people, the first category label corresponding to the first image frame 31 in the preset image 30 may be directly set as the person 1, and the first category label corresponding to the second image frame 32 may be set as the person 2 in the process of marking the positions of the two people.

Step S50, determining at least one sample image corresponding to each of the preset images according to the corresponding at least one image frame.

In a possible implementation manner, after at least one preset image is determined, at least one sample image corresponding to each preset image is determined according to an image frame corresponding to each preset image. Each sample image is obtained by cutting out a partial region in a preset image. Wherein, a plurality of sample images can be obtained by cutting according to each image frame of the preset image. Optionally, each preset image may be subjected to at least one data enhancement, and a region in at least one image frame is cut out after each data enhancement as a sample image. The process of data enhancement may include translating the image frame, flipping the image frame, and scaling down the image frame, etc., so that the sample image captured after each data enhancement can include different regions of the object.

Further, because the formats and attributes of different preset images are different, in order to ensure that the obtained sample image conforms to the format required by the training re-recognition network, data preprocessing can be performed on each preset image before data enhancement is performed. The data preprocessing process may include any processing modes such as format conversion, image brightness adjustment, overall noise reduction, and the like, and at least one processing mode may be selected in advance for data preprocessing as required.

FIG. 4 shows a schematic diagram of a sample image according to an embodiment of the disclosure. In one possible implementation, after the preset image 30 is determined, a plurality of sample images corresponding to at least one object are cut out from the preset image 30. When the preset image 30 includes the person 1 and the person 2, and the image frames of the preset image 30 are the first image frame 31 representing the area where the face of the person 1 is located and the second image frame 32 representing the area where the face of the person 2 is located, at least one first object sample image 33 corresponding to the person 1 in the preset image 30 and at least one second object sample image 34 corresponding to the person 2 in the preset image 30 can be determined.

Optionally, before the sample image is extracted, data preprocessing is performed on the preset image 30, then operations such as translation, flipping, and size scaling are performed on the first image frame 31 and the second image frame 32, and after each operation, the content in the image frames is captured, so as to obtain the corresponding first object sample image 33 and second object sample image 34.

Fig. 5 illustrates a schematic diagram of a determination of a sample graph according to an embodiment of the disclosure. As shown in fig. 5, when determining a sample image for training a re-recognition network according to the embodiment of the present disclosure, a preset image set 50 including at least one preset image may be determined first, and the preset image set 50 is randomly sampled 51 to obtain at least one preset image 52 including an object. The preset images 52 are subjected to image preprocessing 53 and data enhancement 54 in sequence, and the area in the image frame of each preset image 52 is cut out to obtain a sample image 55.

In a possible implementation manner, the sequence of the process of randomly sampling in the preset image set to obtain the preset image and the process of extracting the sample image in the preset image may be changed, that is, the preset image may be randomly sampled first and then the sample image is extracted, or each preset image in the preset image set may be randomly sampled after the sample image is extracted. Optionally, the order of image pre-processing and data enhancement in the sample image extraction process may also be adjusted.

Based on the manner, the embodiment of the disclosure can acquire a plurality of sample images corresponding to each object in a data enhancement manner, and the number of the sample images is greatly expanded. Furthermore, the GPU can be used for carrying out image processing in parallel, so that the image processing speed is shortened, and unnecessary background noise is reduced. Meanwhile, the problem that the training process is difficult to calculate due to excessive sample types is solved by randomly extracting the preset images to determine the sample images, the extracted preset images are guaranteed to be representative through random sampling, and the characteristics of the preset image set can be reflected.

And step S60, performing first-stage training on the re-recognition network according to the sample image and the corresponding first class label.

In a possible implementation manner, after a plurality of sample images are determined, the re-recognition network may be directly subjected to a first-stage training according to each sample image and the first class label. In the training process, the re-recognition network can output a first prediction category of the input sample image, and the characterization re-recognition network predicts the category of the object in the input sample image. Because the sample image only comprises one object, the real image category of the sample image is the real object category, the loss can be calculated according to the real image category and the real object category of the sample image and the first prediction category respectively, and the total re-identification network loss is obtained for network adjustment.

Optionally, in order to improve the efficiency of the training process and reduce the manual labeling cost, the sample image labeling is not required to be performed manually before the network re-recognition training, and the first class label of the image frame corresponding to the sample image may be directly used as the first class label of the sample image, that is, the class of the area where each object position in the preset image is located is used as the real image class of the sample image obtained by collecting the partial area. Further, since most of the preset images include only one object, the other part of the preset images include a smaller number of objects. Therefore, in order to improve labeling efficiency, the actual second class label may not be labeled according to the object class in each sample image. And directly taking the first class label of each sample image as a second class label for representing the object class in the first-stage training process of the re-identification network, and correcting the real object class in the sample image in the second-stage training process.

For example, when the preset image has image frames of three person objects, and the first category labels of each image frame are "person 1", "person 2", and "person 3", respectively, the sample image in each image frame is extracted, respectively. When the sample image corresponding to each image frame is manually labeled, the identity of each person in the sample image can be specifically identified, and the corresponding second type labels can be labeled as 'Zhang III', 'Liqu' and 'WangWu'. In order to save the time of the labeling process and improve the efficiency of the re-recognition network training process, the person identity in each sample image can not be recognized, and the second type labels of the sample images corresponding to each image frame are rapidly labeled as 'person 1', 'person 2' and 'person 3' only by inheriting the first type labels.

Based on the second class label determination mode, the process of performing the first-stage training on the re-recognition network comprises the steps of determining the first class label corresponding to each sample image as the second class label, inputting each sample image into the re-recognition network, and outputting the corresponding first prediction class. And determining a first network loss according to the first class label, the second class label and the first prediction class corresponding to each sample image, and adjusting and re-identifying the network according to the first network loss. Wherein the first loss may be determined according to the first class label and the first prediction class of the sample image, and the second loss may be determined according to the second class label and the first prediction class of the sample image. That is, a first loss may be determined according to the first class label and the first prediction class corresponding to each sample image, a second loss may be determined according to the second class label and the first prediction class corresponding to each sample image, a first network loss may be determined according to the first loss and the second loss, and the network re-identification may be adjusted according to the first network loss. Wherein the first network loss may be obtained by calculating a weighted sum of the first loss and the second loss.

In one possible implementation, the first penalty may be a triplet penalty and the second penalty may be a cross-entropy classification penalty. That is, the first loss may be obtained by calculating a triplet loss of the first class label and the first prediction class of each sample image, and the second loss may be obtained by calculating a cross-entropy classification loss of the second class label and the first prediction class of each sample image. Wherein the triplet loss is inversely proportional to the distance between samples of the same object class and directly proportional to the distance between samples of different object classes. The loss of triples can be reduced through network adjustment, the distance between the same object class samples can be reduced, and the distance between different object class samples can be increased. The cross entropy classification loss is inversely proportional to the distance between the same image class samples, and the cross entropy classification loss can be reduced through network adjustment, and the distance between the same image class samples is pulled in.

Alternatively, the computation of the triplet loss and the cross-entropy classification loss may be obtained by the following formula two and formula three respectively:

wherein the triple loss in the formula two is L_thP × K is the total number of sample images, a is any sample image, P is a sample image with the largest distance between the corresponding feature vector and the corresponding feature vector of a in the feature space among a plurality of sample images identical to the first class label of a, and n is a sample image with the smallest distance between the corresponding feature vector and the corresponding feature vector of a in the feature space among a plurality of sample images different from the first class label of a. Alpha is a preset correction parameter. The cross entropy classification loss in formula three is L, N is the number of sample images, M is the number of second class labels, p_icFor the prediction probability, y, that the sample image i belongs to the first prediction class c_icWhen the second type label of sample i is c, 1 is taken, and when it is not c, 0 is taken.

After determining the first network loss, a first stage adjustment may be made to the re-identified network until the first network loss satisfies a first predetermined condition. The first preset condition may be that the first network loss is smaller than a preset first threshold. Based on the characteristics of triple loss and cross entropy classification loss, the re-recognition network adjusted in the first stage can obtain a feature space with reasonable distribution. That is, the feature extraction layer of the re-recognition network can be adjusted so that the re-recognition network can extract similar feature vectors for images of the same image category, and can also extract similar feature vectors for images of the same object category.

FIG. 6 shows a schematic diagram of a first stage training process of a re-recognition network according to an embodiment of the present disclosure. As shown in fig. 6, after the sample image 60 is determined, according to a first category label 61 obtained while a preset image corresponding to the sample image 60 is acquired, a first category label 61 and a second category label 62 corresponding to the sample image 60 are determined in the embodiment of the disclosure. Each sample image 60 is input into the re-recognition network 63 to obtain a first prediction category 64, and a first loss 65 is calculated according to the first prediction type 64 and the first category label 61 of each sample image 60. Meanwhile, a second loss 66 is calculated from the first prediction type 64 and the second class label 62 of each sample image 60, and the re-recognition network 63 is jointly adjusted according to the first loss 65 and the second loss 66. Alternatively, the adjustment may be to calculate a weighted sum of the first loss 65 and the second loss 66 to obtain a first network loss, and perform a first stage adjustment on the re-identification network 63 until the first network loss satisfies a first preset condition.

And step S70, determining the pseudo label of the sample image according to the re-recognition network after the training of the first stage is finished.

In a possible implementation manner, after the first-stage training process is performed on the re-recognition network, the pseudo label of each sample image can be determined according to the feature space which is reasonably distributed after the re-recognition network is trained. And the pseudo label of each sample image is used for characterizing the class of the object in the sample image in the second stage training process. Pseudo-tags may be tags of any content, each pseudo-tag being used to uniquely characterize a class of objects.

Optionally, the manner of determining the pseudo label according to the re-recognition network after the first-stage training may be to input each sample image into the re-recognition network after the first-stage training is finished, so as to obtain a feature vector after feature extraction is performed on each sample image. And clustering the characteristic vectors of each sample image, and determining the identification information uniquely corresponding to each clustered cluster obtained after clustering. And taking the identification information corresponding to each cluster as a pseudo label of each characteristic vector corresponding to the sample image. Wherein, the clustering process can be realized based on a k-means clustering algorithm. The unique corresponding identification information of each cluster can be preset or generated according to a preset rule.

And step S80, performing second-stage training on the re-recognition network obtained after the first-stage training according to the sample image, the corresponding first class label and the pseudo label.

In a possible implementation manner, after the pseudo label of each sample image is obtained, the first class label corresponding to each sample image in the first-stage training process is used as a real image class, and the pseudo label corresponding to each sample image is used as a real object class. And further, performing second-stage training of the re-recognition network based on the real image class, the real object class and the sample image class predicted by the re-recognition network of each current sample image. That is, each sample image may be input to the re-recognition network obtained after the first stage training, and the corresponding second prediction class may be output. And determining a second network loss according to the first class label, the pseudo label and the second prediction class corresponding to each sample image, and adjusting the re-identification network according to the second network loss.

Optionally, as in the first stage training process of the re-recognition network, the second stage training process calculates a loss according to the real image class and the real object class of the sample image and the second prediction class, respectively, to obtain a total re-recognition network loss for network adjustment. That is, the second stage of adjusting the re-identification network may include determining a third loss according to the first class label and the second prediction class corresponding to each sample image, and determining a fourth loss according to the pseudo label and the second prediction class corresponding to each sample image. And determining a second network loss according to the third loss and the fourth loss, and adjusting the re-identification network according to the second network loss. Wherein the second network loss may be obtained by calculating a weighted sum of the third loss and the fourth loss.

In one possible implementation, the third penalty may be a triplet penalty and the fourth penalty may be a cross-entropy classification penalty. That is, the third loss may be obtained by calculating triple losses of the first class label and the second prediction class of each sample image, and the fourth loss may be obtained by calculating cross entropy classification losses of the pseudo label and the first prediction class of each sample image. Wherein the triplet loss is inversely proportional to the distance between samples of the same object class and directly proportional to the distance between samples of different object classes. The loss of triples can be reduced through network adjustment, the distance between the same object class samples can be reduced, and the distance between different object class samples can be increased. The cross entropy classification loss is inversely proportional to the distance between the same image class samples, and the cross entropy classification loss can be reduced through network adjustment, and the distance between the same image class samples is pulled in. Optionally, the calculation process of the third loss may be the same as the calculation process of the first loss, and the calculation process of the fourth loss may be the same as the calculation process of the second loss, which is not described herein again.

After determining the second network loss by calculating a weighted sum of the third loss and the fourth loss, a second stage adjustment may be made to the re-identified network until the second network loss satisfies a second predetermined condition. The second preset condition may be that the second network loss is smaller than a preset second threshold. Based on the characteristics of the triple loss and the cross entropy classification loss, the re-recognition network adjusted in the second stage can obtain a feature space with more reasonable distribution. That is to say, by adjusting the feature extraction layer of the re-recognition network, the re-recognition network can more accurately extract similar feature vectors for images with the same image category, and can more accurately extract similar feature vectors for images with the same object category.

FIG. 7 is a diagram illustrating a second stage training process of a re-recognition network according to an embodiment of the disclosure. As shown in fig. 7, after the sample image 70 is determined, according to a first class label 71 obtained while a preset image corresponding to the sample image 70 is acquired, a first class label 71 and a pseudo label 72 corresponding to the sample image 70 are determined in the embodiment of the disclosure. Each sample image 70 is input into the re-recognition network 73 to obtain a second prediction category 74, and a third loss 75 is calculated according to the second prediction type 74 and the first category label 71 of each sample image 70. Meanwhile, a fourth loss 76 is calculated from the second prediction type 74 and the pseudo label 72 of each sample image 70, and the re-recognition network 73 is adjusted jointly according to the third loss 75 and the fourth loss 76. Alternatively, the adjustment may be to calculate a weighted sum of the third loss 75 and the fourth loss 76 to obtain a second network loss, and perform a first stage adjustment on the re-identification network 73 until the second network loss satisfies a second preset condition.

Based on the training method, the re-recognition network with high accuracy can be obtained through label-free data training at low cost and quickly, similar feature vectors can be accurately extracted from the images with the same image type by the re-recognition network, and the similar feature vectors can also be accurately extracted from the images with the same object type, so that a reasonably distributed feature space can be obtained. Furthermore, the accuracy of the re-recognition network can be obtained through two-stage training, and the re-recognition network can accurately re-recognize the image to be recognized to obtain an accurate re-recognition result.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides an object re-identification apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any object re-identification method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 8 shows a schematic diagram of an object re-recognition apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the object re-recognition apparatus of the embodiment of the present disclosure may include an image determination module 80, a set determination module 81, and a re-recognition module 82.

An image determination module 80 for determining an image to be recognized including a target object;

a set determination module 81 for determining a set of images comprising at least one candidate image, each of said candidate images comprising an object;

a re-recognition module 82, configured to input the image to be recognized and the image set into a re-recognition network to obtain a re-recognition result, where, in a case that a target candidate image exists in the image set, the re-recognition result includes the target candidate image, and an object included in the target candidate image matches the target object;

the device further comprises:

In one possible implementation, the re-identification module 82 includes:

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 9 shows a schematic diagram of an electronic device 800 according to an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 9, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 10 shows a schematic diagram of another electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 10, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An object re-recognition method, the method comprising:

determining an image to be recognized including a target object;

2. The method of claim 1, wherein each of the candidate images has a corresponding second class label characterizing a class of objects in the corresponding image;

the method further comprises the following steps:

3. The method according to claim 1 or 2, wherein the training process of the re-recognition network comprises:

4. The method of claim 3, wherein determining at least one preset image comprising an object comprises:

5. The method according to claim 3 or 4, wherein the determining at least one sample image corresponding to each of the preset images according to the corresponding at least one image frame comprises:

6. The method of claim 5, wherein the pre-set images are pre-processed before data enhancement of each of the pre-set images.

7. The method of any of claims 3-6, wherein the first stage training of the re-recognition network based on the sample images and corresponding first class labels comprises:

8. The method of claim 7, wherein determining a first network loss according to the first class label, the second class label, and the first prediction class corresponding to each of the sample images, and adjusting the re-identified network according to the first network loss comprises:

9. The method according to any one of claims 3-8, wherein the determining the pseudo label of the sample image according to the re-recognition network at the end of the first stage training comprises:

10. The method of claim 9, wherein the clustering process is implemented based on a k-means clustering algorithm.

11. The method according to any one of claims 3 to 10, wherein the performing of the second-stage training on the re-recognition network obtained after the first-stage training according to the sample image and the corresponding first class label and pseudo label comprises:

12. The method of claim 11, wherein determining a second network loss according to the first class label, the pseudo label, and the second prediction class corresponding to each of the sample images, and adjusting the re-identified network according to the second network loss comprises:

13. The method according to any of claims 8-12, wherein the first and/or third losses are triplet losses and the second and/or fourth losses are cross-entropy classification losses.

14. The method according to any one of claims 1 to 13, wherein the inputting the image to be recognized and the image set into a re-recognition network to obtain a re-recognition result comprises:

15. The method according to claim 14, wherein the preset condition is that the similarity value is maximum and greater than a similarity threshold value.

16. An object re-recognition apparatus, characterized in that the apparatus comprises:

17. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 15.

18. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 15.