CN108875487B

CN108875487B - Training of pedestrian re-recognition network and pedestrian re-recognition based on training

Info

Publication number: CN108875487B
Application number: CN201710906719.5A
Authority: CN
Inventors: 罗浩; 张弛
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2021-06-15
Anticipated expiration: 2037-09-29
Also published as: CN108875487A

Abstract

The invention provides a method, a device, a system and a storage medium for training a pedestrian re-identification network and re-identifying pedestrians based on the training method, wherein the training method of the pedestrian re-identification network comprises the following steps: pre-training a reference network by using classification loss; and optimizing the pre-trained reference network by combining the classification loss and the quintuple loss to obtain a pedestrian re-identification network. According to the training method, the device and the system for the pedestrian re-identification network, the training is carried out by combining the classification loss and the distance loss with the storage medium, so that the training process can be accelerated, and the precision can be improved; in addition, a quintuple method is adopted in a distance loss link, and compared with the traditional triplet, improved triplet and quadruplet methods, the training time can be obviously shortened, and the precision is further improved.

Description

Training of pedestrian re-recognition network and pedestrian re-recognition based on training

Technical Field

The invention relates to the technical field of pedestrian re-identification, in particular to training of a pedestrian re-identification network, and a pedestrian re-identification method, device, system and storage medium based on the training.

Background

Pedestrian re-identification, also known as pedestrian re-identification, is a technique for determining whether a specific pedestrian is present in an image or video sequence using computer vision techniques. Given a monitored pedestrian image, the pedestrian image is retrieved across the device. The camera aims to make up the visual limitation of the existing fixed camera, can be combined with a pedestrian detection/pedestrian tracking technology, and can be widely applied to the fields of intelligent video monitoring, intelligent security and the like.

The existing pedestrian re-identification method can be divided into two methods according to the training thought: the first is to take each pedestrian as a category and convert the pedestrian re-identification into an image classification problem; the second method is to extract the features of each pedestrian picture, calculate the distance between the features of the two pictures, train a network model for extracting the features by minimizing the distance between the picture features of the same person and maximizing the distance between the picture features of different pedestrians, and the current method comprises a triplet group, an improved triplet group and a quadruplet group.

However, the model trained based on the classification loss is difficult to reach a high level in precision, and the model trained based on the distance loss is usually better than the former model, but the network training time is very long.

Disclosure of Invention

In order to solve the problems, the invention provides a training scheme for a pedestrian re-recognition network, which combines the advantages of two methods and accelerates the training process and improves the precision by combining classification loss and distance loss. The following briefly describes a scheme proposed by the present invention for training a pedestrian re-recognition network, and further details will be described in the following detailed description with reference to the drawings.

According to an aspect of the present invention, there is provided a training method for a pedestrian re-recognition network, the training method including: pre-training a reference network by using classification loss; and optimizing the pre-trained reference network by combining the classification loss and the quintuple loss to obtain a pedestrian re-identification network.

In an embodiment of the present invention, the pre-training of the reference network with the classification loss includes: inputting a sample picture to the reference network; comparing a prediction vector output by the reference network for the sample picture with a label vector of the sample picture to obtain a classification loss; adjusting a parameter of the reference network based on the classification loss; and repeating the steps until the classification accuracy and the classification loss are basically unchanged.

In one embodiment of the invention, the reference network is a residual network.

In one embodiment of the invention, a pre-processing operation is performed on the sample picture prior to inputting the sample picture to the reference network.

In one embodiment of the present invention, the tuning the pre-trained reference network by the joint classification loss and the quintuple loss comprises: inputting five sample pictures of quintuple according to preset requirements and sequence; calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture; calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss.

In one embodiment of the present invention, the calculated classification loss is an average of the classification losses of the five sample pictures.

In one embodiment of the invention, the quintuple loss is defined as:

l_qtd (positive sample 1, positive sample 2) -d (negative sample 1, negative sample 21) + d (negative sample 21, negative sample 22) -d (negative sample 1, positive sample 2) + a

Wherein l_qtLoss of quintuple; the positive sample 1, the positive sample 2, the negative sample 1, the negative sample 21 and the negative sample 22 are the five sample pictures, the positive sample 1 and the positive sample 2 are two different pictures of a first pedestrian, the negative sample 1 is a picture of a second pedestrian, and the negative sample 21 and the negative sample 22 are two different pictures of a third pedestrian; d is the distance between the feature vectors of the two pictures; a is a constant parameter set according to requirements.

In one embodiment of the invention, the final penalty is a weighted sum of the calculated classification penalty and the calculated quintuple penalty.

According to another aspect of the present invention, there is provided a training apparatus for a pedestrian re-recognition network, the training apparatus including: the pre-training module is used for pre-training the reference network by utilizing the classification loss; and the tuning module is used for tuning the pre-trained reference network by combining the classification loss and the quintuple loss to obtain a pedestrian re-identification network.

In one embodiment of the present invention, the pre-training of the reference network by the pre-training module further comprises: inputting a sample picture to the reference network; comparing a prediction vector output by the reference network for the sample picture with a label vector of the sample picture to obtain a classification loss; adjusting a parameter of the reference network based on the classification loss; and repeating the above operations until the classification accuracy and the classification loss are substantially unchanged.

In an embodiment of the invention, the pre-training module is further configured to: performing a pre-processing operation on the sample picture prior to inputting the sample picture to the reference network.

In one embodiment of the present invention, the tuning of the pre-trained reference network by the tuning module comprises: inputting five sample pictures of quintuple according to preset requirements and sequence; calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture; calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss.

In one embodiment of the invention, the quintuple loss is defined as:

According to another aspect of the present invention, a pedestrian re-identification method is provided, wherein the pedestrian re-identification method performs pedestrian re-identification by using a pedestrian re-identification network trained by the training method of the pedestrian re-identification network described in any one of the above.

According to another aspect of the present invention, there is provided a pedestrian re-identification apparatus for implementing the above-described pedestrian re-identification method.

According to a further aspect of the present invention, there is provided a computing system comprising a storage device and a processor, the storage device having stored thereon a computer program for execution by the processor, the computer program, when executed by the processor, performing the method of training a pedestrian re-identification network as defined in any one of the above or performing the method of pedestrian re-identification.

According to a further aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed, performs the method of training a pedestrian re-recognition network of any one of the above or performs the method of pedestrian re-recognition.

According to the training method, the device and the system for the pedestrian re-identification network, the training is carried out by combining the classification loss and the distance loss with the storage medium, so that the training process can be accelerated, and the precision can be improved; in addition, a quintuple method is adopted in a distance loss link, and compared with the traditional triplet, improved triplet and quadruplet methods, the training time can be obviously shortened, and the precision is further improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 illustrates a schematic block diagram of an example electronic device for implementing a training method, apparatus, system and storage medium for a pedestrian re-recognition network in accordance with embodiments of the present invention;

FIG. 2 shows a schematic flow diagram of a method of training a pedestrian re-identification network in accordance with an embodiment of the invention;

FIG. 3 shows a schematic diagram of reference network pre-training according to an embodiment of the invention;

FIG. 4 is a diagram illustrating tuning after pre-training of a reference network according to an embodiment of the invention;

FIG. 5 shows a schematic block diagram of a training apparatus for a pedestrian re-identification network in accordance with an embodiment of the present invention; and

FIG. 6 shows a schematic block diagram of a training system for a pedestrian re-identification network in accordance with an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

First, an example electronic device 100 for implementing a training method, an apparatus, a system, and a storage medium of a pedestrian re-recognition network according to an embodiment of the present invention is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, an output device 108, and an image capture device 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, and the like.

The image capture device 110 may capture images (e.g., photographs, videos, etc.) desired by a user and store the captured images in the storage device 104 for use by other components. The image capture device 110 may be a camera. It should be understood that the image capture device 110 is merely an example, and the electronic device 100 may not include the image capture device 110. In this case, the sample picture or the sample picture may be acquired by using another image acquisition apparatus, and the acquired picture may be transmitted to the electronic device 100.

Exemplary electronic devices for implementing the training method, apparatus, system and storage medium of the pedestrian re-recognition network according to embodiments of the present invention may be implemented as, for example, smart phones, tablet computers, and the like.

Next, a training method 200 of a pedestrian re-recognition network according to an embodiment of the present invention will be described with reference to fig. 2. As shown in fig. 2, the training method 200 of the pedestrian re-identification network may include the following steps:

in step S210, the reference network is pre-trained with classification loss.

In one embodiment, the network model may be pre-trained by using the classification loss, and since the network may be rapidly converged by several tens of iterative training, and the distance loss-based method may take at least ten times of training time to achieve the same performance, the training time may be greatly shortened by using the classification loss to pre-train the network model.

In one embodiment, the network model pre-trained with classification loss is referred to as a reference network, and the tuning steps described subsequently are performed on this pre-trained reference network. Illustratively, the reference network may be a residual network, such as a residual network (ResNet50) pre-trained for large scale image recognition challenge network (ImageNet). When the reference network is the residual network, the sample picture may be preprocessed before being input to the reference network for training.

For example, the sample picture size can be transformed to 224 × 224 pixels, the image format is a BGR channel format, and each channel needs to be subtracted by the average value of all images in ImageNet in the channel, which is expressed by the formula:

new B channel-original B channel-104.00698793

New G channel-116.66876762 original G channel

New R channel-original R channel-122.67891434

The above pre-treatment process is merely exemplary and not required. In other examples, other custom convolutional networks or other suitable networks may also be employed as the reference network, and accordingly, other suitable pre-processing processes may be implemented before the sample picture is input to the reference network.

In one embodiment, the pre-training of the reference network with the classification loss in step S210 may further include: inputting the sample picture into a reference network; comparing a prediction vector output by a reference network for the sample picture with a label vector of the sample picture to obtain a classification loss; adjusting a parameter of the reference network based on the classification loss; and repeating the steps until the classification accuracy and the classification loss are basically unchanged.

In particular, the structure and pre-training process of the reference network described above may be further understood in conjunction with fig. 3. As shown in fig. 3, an input image (e.g., a preprocessed sample picture) may be input to a reference network (shown as a residual network ResNet50 in fig. 3), and after passing through a normalization classification layer (Softmax), the reference network outputs a prediction vector for each sample picture, the value of the ith element of the prediction vector represents the probability that the picture is the ith person (i is 1,2,3, … …, N, where N is a natural number), so the sum of the elements of the vector is 1.

The prediction vector can then be compared to the label vector (i.e., labeled label, e.g., manually labeled label) of the sample picture to derive a classification penalty. Since the tag vector is a one-hot vector, i.e. only one element is 1 and the other elements are 0, this 1 represents the second person, i.e. the ID information. The classification loss is the difference between the prediction vector output by the reference network and the label vector (for example, using cross entropy loss). The classification loss may then be propagated back to the reference network to adjust parameters of the reference network.

The forward calculation of the prediction vector and the backward update of the network parameters are a complete iteration, and the training is stopped until the final classification accuracy and the classification loss are basically unchanged. Usually this phase requires only a few tens of iterations of the network to converge quickly. Therefore, the training time can be greatly shortened.

Referring back now to fig. 2, the subsequent steps of the training method 200 of the pedestrian re-identification network according to an embodiment of the present invention are continuously described.

In step S220, the pre-trained reference network is optimized to obtain a pedestrian re-recognition network by combining the classification loss and the quintuple loss.

In one embodiment, quintuple refers to five sample pictures of three different pedestrians selected according to certain requirements and order, which is specified as follows:

(1) picture 1: the first picture of pedestrian 1, named positive sample 1;

(2) picture 2: the second picture of the pedestrian 1, different from the picture 1, is named as a positive sample 2;

(3) picture 3: the first picture of the pedestrian 2 is named as a negative sample 1;

(4) picture 4: the first picture of the pedestrian 3, named negative example 21;

(5) picture 5: the second picture of the pedestrian 3, different from the picture 4, is named negative example 22.

In one embodiment, the quintuple loss may be defined as follows:

Wherein l_qtFor quintuple loss, the positive sample 1, the positive sample 2, the negative sample 1, the negative sample 21 and the negative sample 22 are the five sample pictures, the positive sample 1 and the positive sample 2 are two different pictures of a first pedestrian, the negative sample 1 is a picture of a second pedestrian, and the negative sample 21 and the negative sample 22 are two different pictures of a third pedestrian; a is a constant parameter set according to requirements (for example, the constant parameter can be set to be 2 or other values set arbitrarily according to actual requirements); d is the distance between the feature vectors of the two pictures, e.g., d (positive sample 1, positive sample 2) is the distance between the feature vectors of the first picture of the pedestrian 1 and the second picture of the pedestrian 1, d (negative sample 1, negative sample 21) is the distance between the first picture of the pedestrian 2 and the first picture of the pedestrian 3, and so on. In one example, d above may represent a euclidean distance. In other examples, the quintuple loss may also be calculated based on other distances between feature vectorsSuch as cosine distances, mahalanobis distances, etc.

The distance (e.g., two-dimensional euclidean distance) between feature vectors (also referred to as picture content feature vectors) of different pictures may define the similarity between different pictures. The sample pictures are input to the pre-trained reference network, and the fully-connected layer Fc (which may also be referred to as a feature layer, as shown in fig. 3) outputs a feature vector corresponding to each sample picture. Assume that the feature vectors extracted after the above-mentioned picture 1 and picture 2 pass through the network are f1 and f respectively₂The feature vectors may be first regularized (regularization), which is formulated as:

where | f | represents the modulus of the vector f, let us assume that f is used_n1And f_n2Respectively represents f₁And f₂And (3) defining the two-dimensional Euclidean distance of the normalized vector as follows:

based on the distance d, the quintuple loss can be calculated.

In one embodiment, the tuning the pre-trained reference network by combining the classification penalty and the quintuple penalty at step S220 may further comprise: inputting five sample pictures of quintuple according to preset requirements and sequence; calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture; calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss.

An exemplary process for tuning the pre-trained reference network using the joint classification penalty and quintuple penalty described above is described below in conjunction with FIG. 4.

As shown in fig. 4, the quintuple sample picture (including the positive sample 1, the positive sample 2, the negative sample 1, the negative sample 21, and the negative sample 22) is input to the trained reference network. Here, in order to distinguish from the reference network of the pre-training stage of step S210, the reference network of the pre-training stage is named as an ID network (IDNet), and the network of the tuning stage of step S220 is named as a quintuple-ID network (Quintuplet-IDNet), however, it should be understood that, in reality, the two stages are the same network structure, as shown in fig. 4, the positive sample 1, the positive sample 2, the negative sample 1, the negative sample 21, and the negative sample 22 are input to the Quintuplet-IDNet, that is, to the IDNet.

After the sample pictures are input into the Quinuplet-IDlet, for each sample picture, the Fc layer outputs a characteristic vector corresponding to the sample picture, and the Softmax layer outputs a prediction vector corresponding to the sample picture. For example, as shown in fig. 4, the feature vectors corresponding to the positive sample 1, the positive sample 2, the negative sample 1, the negative sample 21, and the negative sample 22 are feature 1, feature 2, feature 3, feature 4, and feature 5, respectively; the prediction vectors corresponding to positive sample 1, positive sample 2, negative sample 1, negative sample 21, and negative sample 22 are ID1, ID2, ID3, ID4, and ID5, respectively.

Then, a classification loss may be calculated based on the prediction vector for each of the sample pictures, the calculation method being similar to that described in step S210. Here, since five pictures are to be input, the final classification loss may be an average of the classification losses of the five pictures. The quintuple loss can then be calculated based on the feature vectors of the five sample pictures, the calculation method being as described above.

Finally, a final loss may be calculated based on the calculated classification loss and the calculated quintuple loss as a final loss of the pedestrian re-identification network. Illustratively, the final penalty is a weighted sum of the calculated classification penalty and the calculated quintuple penalty, expressed as:

loss＝λl_ID+(1-λ)l_qt

wherein, λ is a weight parameter in the range of 0-1, which can be adjusted by itself. Illustratively, λ may be set to 0.5.

And (4) tuning the pre-trained IDNet through the combined classification loss and quintuple loss, wherein the tuned Quintuplet-IDNet is used as a final pedestrian re-recognition network for pedestrian re-recognition.

Based on the trained pedestrian re-recognition network, after a picture probe to be queried and a pedestrian image set galery to be searched are input, a feature vector of each picture can be obtained through forward propagation of the trained pedestrian re-recognition network, and a similarity ranking can be obtained by calculating the distance between the feature vector of the probe picture and the feature vector of each picture in the galery. When the minimum distance between the gallery and the probe picture is smaller than a set threshold value, the picture (namely the most similar picture) in the gallery and the probe picture are considered to be the same pedestrian, and the task of re-identifying the pedestrian is completed.

Based on the above description, the training method of the pedestrian re-identification network according to the embodiment of the invention combines the classification loss and the distance loss to train, so that the finally trained network has the advantages of two methods, namely the classification loss based method and the distance loss based method, and the training process can be accelerated and the precision can be improved; in addition, a quintuple method is adopted in a distance loss link, compared with the traditional triplet, improved triplet and quadruplet methods, the training time can be shortened to about half, the training time is obviously shortened, the intra-class distance can be further shortened, the inter-class distance can be further lengthened, and the precision is further improved.

The training method of the pedestrian re-recognition network according to the embodiment of the invention is exemplarily described above. Illustratively, the training method of the pedestrian re-recognition network according to the embodiment of the present invention may be implemented in a device, an apparatus or a system having a memory and a processor.

In addition, the training method of the pedestrian re-identification network according to the embodiment of the invention has high processing speed and can be conveniently deployed on mobile equipment such as a smart phone, a tablet computer and a personal computer. Alternatively, the training method for the pedestrian re-recognition network according to the embodiment of the invention can also be deployed at a server side (or a cloud side). Alternatively, the training method of the pedestrian re-recognition network according to the embodiment of the invention can also be distributively deployed at a server side (or a cloud side) and a personal terminal side.

A training apparatus for a pedestrian re-recognition network provided according to another aspect is described below with reference to fig. 5. Fig. 5 shows a schematic block diagram of a training apparatus 500 of a pedestrian re-recognition network according to an embodiment of the present invention.

As shown in fig. 5, the training apparatus 500 for a pedestrian re-recognition network according to an embodiment of the present invention includes a pre-training module 510 and an adjusting and optimizing module 520. The various modules may respectively perform the various steps/functions of the training method of the pedestrian re-identification network described above in connection with fig. 2. Only the main functions of the modules of the training apparatus 500 for the pedestrian re-identification network will be described below, and details that have been described above will be omitted.

The pre-training module 510 is used to pre-train the reference network with the classification loss. The tuning module 520 is used for tuning the pre-trained reference network to obtain the pedestrian re-identification network by combining the classification loss and the quintuple loss.

In one embodiment, the pre-training module 510 may pre-train the network model using the classification loss, which may greatly reduce the training time since the network may be rapidly converged through several tens of iterative training.

In one embodiment, the network model pre-trained by the pre-training module 510 using the classification loss is referred to as a reference network, and the subsequent tuning process of the tuning module 520 is performed on the pre-trained reference network. Illustratively, the reference network may be a residual network, such as a residual network (ResNet50) pre-trained for large scale image recognition challenge network (ImageNet). When the reference network is the residual network, the pre-training module 510 may pre-process the sample picture before inputting the sample picture into the reference network for training. In other examples, the pre-training module 510 may also use other customized convolutional networks or other suitable networks as the reference network, and accordingly, the pre-training module 510 may perform a corresponding pre-processing procedure before inputting the sample picture into the reference network.

In one embodiment, the pre-training module 510 pre-training the reference network with the classification loss may further include: inputting the sample picture into a reference network; comparing a prediction vector output by a reference network for the sample picture with a label vector of the sample picture to obtain a classification loss; adjusting a parameter of the reference network based on the classification loss; and repeating the above operations until the classification accuracy and the classification loss are substantially unchanged. The structure and the pre-training process of the reference network can be further understood with reference to the above description in conjunction with fig. 3, and are not described herein again for brevity.

In one embodiment, quintuple refers to five sample pictures of three different pedestrians selected according to certain requirements and order. In one embodiment, the tuning module 520 tuning the pre-trained reference network in conjunction with the classification penalty and the quintuple penalty may further comprise: inputting five sample pictures of quintuple according to preset requirements and sequence; calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture; calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss. The process of tuning the pre-trained reference network by joint classification loss and quintuple loss can be understood with reference to fig. 4, and for brevity, will not be described again here.

Based on the above description, the training device of the pedestrian re-identification network according to the embodiment of the invention performs training by combining classification loss and distance loss, so that the finally trained network has the advantages of two methods, namely classification loss and distance loss, and the training process can be accelerated and the precision can be improved; in addition, a quintuple method is adopted in a distance loss link, compared with the traditional triplet, improved triplet and quadruplet methods, the training time can be shortened to about half, the training time is obviously shortened, the intra-class distance can be further shortened, the inter-class distance can be further lengthened, and the precision is further improved.

FIG. 6 shows a schematic block diagram of a training system 600 for a pedestrian re-identification network in accordance with an embodiment of the present invention. The training system 600 for a pedestrian re-identification network includes a storage device 610 and a processor 620.

The storage device 610 stores program codes for implementing corresponding steps in the training method of the pedestrian re-recognition network according to the embodiment of the present invention. The processor 620 is configured to run the program codes stored in the storage device 610 to execute the corresponding steps of the training method of the pedestrian re-identification network according to the embodiment of the present invention, and is configured to implement the corresponding modules in the training device of the pedestrian re-identification network according to the embodiment of the present invention.

In one embodiment, the program code, when executed by the processor 620, causes the training system 600 for a pedestrian re-identification network to perform the steps of: pre-training a reference network by using classification loss; and optimizing the pre-trained reference network by combining the classification loss and the quintuple loss to obtain a pedestrian re-identification network.

In one embodiment, the pre-training of the reference network with classification loss performed by the training system 600 of the pedestrian re-identification network when the program code is executed by the processor 620 comprises: inputting a sample picture to the reference network; comparing a prediction vector output by the reference network for the sample picture with a label vector of the sample picture to obtain a classification loss; adjusting a parameter of the reference network based on the classification loss; and repeating the steps until the classification accuracy and the classification loss are basically unchanged.

In one embodiment, the reference network is a residual network.

In one embodiment, a pre-processing operation is performed on the sample picture prior to inputting the sample picture to the reference network.

In one embodiment, the tuning of the pre-trained reference network by the joint classification penalty and quintuple penalty performed by the training system 600 of the pedestrian re-identification network when the program code is executed by the processor 620 comprises: inputting five sample pictures of quintuple according to preset requirements and sequence; calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture; calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss.

In one embodiment, the calculated classification loss is an average of the classification losses of the five sample pictures.

In one embodiment, the quintuple loss is defined as:

In one embodiment, the final penalty is a weighted sum of the calculated classification penalty and the calculated quintuple penalty.

Furthermore, according to an embodiment of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the corresponding steps of the training method of the pedestrian re-identification network according to an embodiment of the present invention, and for implementing the corresponding modules in the training device of the pedestrian re-identification network according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media. The computer readable storage medium can be any combination of one or more computer readable storage media, for example, one computer readable storage medium containing computer readable program code for pre-training a benchmark network with classification losses and another computer readable storage medium containing computer readable program code for tuning the pre-trained benchmark network with joint classification losses and quintuple losses to arrive at a pedestrian re-identification network.

In one embodiment, the computer program instructions may implement the functional modules of the training apparatus of the pedestrian re-identification network according to the embodiment of the present invention when executed by a computer, and/or may execute the training method of the pedestrian re-identification network according to the embodiment of the present invention.

In one embodiment, the computer program instructions, when executed by a computer or processor, cause the computer or processor to perform the steps of: pre-training a reference network by using classification loss; and optimizing the pre-trained reference network by combining the classification loss and the quintuple loss to obtain a pedestrian re-identification network.

In one embodiment, the computer program instructions, which when executed by a computer or processor, cause the computer or processor to perform the pre-training of a reference network with classification losses comprises: inputting a sample picture to the reference network; comparing a prediction vector output by the reference network for the sample picture with a label vector of the sample picture to obtain a classification loss; adjusting a parameter of the reference network based on the classification loss; and repeating the steps until the classification accuracy and the classification loss are basically unchanged.

In one embodiment, the reference network is a residual network.

In one embodiment, the computer program instructions, when executed by a computer or processor, cause the computer or processor to perform the joint classification penalty and quintuple penalty tuning a pre-trained benchmark network comprising: inputting five sample pictures of quintuple according to preset requirements and sequence; calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture; calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss.

In one embodiment, the quintuple loss is defined as:

The modules in the training apparatus of the pedestrian re-identification network according to the embodiment of the present invention may be implemented by a processor of an electronic device for training of the pedestrian re-identification network according to the embodiment of the present invention running computer program instructions stored in a memory, or may be implemented when computer instructions stored in a computer-readable storage medium of a computer program product according to the embodiment of the present invention are run by a computer.

According to the training method, the training device and the training system for the pedestrian re-recognition network and the storage medium, the classification loss and the distance loss are combined for training, so that the finally trained network has the advantages of two methods, namely the classification loss based method and the distance loss based method, the training process can be accelerated, and the precision can be improved; in addition, a quintuple method is adopted in a distance loss link, compared with the traditional triplet, improved triplet and quadruplet methods, the training time can be shortened to about half, the training time is obviously shortened, the intra-class distance can be further shortened, the inter-class distance can be further lengthened, and the precision is further improved.

The foregoing exemplarily describes a training method, apparatus, system and storage medium for a pedestrian re-recognition network according to the embodiments of the present invention. The invention also provides a pedestrian re-identification method, which adopts the pedestrian re-identification network trained by the training method of the pedestrian re-identification network described above to carry out pedestrian re-identification. The invention also provides a pedestrian re-identification device which is used for implementing the pedestrian re-identification method. The invention also provides a pedestrian re-identification system which comprises a storage device and a processor, wherein the storage device is stored with a computer program run by the processor, and the computer program executes the pedestrian re-identification method when being run by the processor. The invention also provides a storage medium, wherein a computer program is stored on the storage medium, and the computer program executes the pedestrian re-identification method when running. A person skilled in the art can understand the method, the apparatus, the system and the storage medium for pedestrian re-identification according to the embodiments of the present invention based on the foregoing training method, the apparatus, the system and the storage medium for pedestrian re-identification network according to the embodiments of the present invention, and details are not described herein for brevity.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A training method of a pedestrian re-recognition network is characterized by comprising the following steps:

pre-training a reference network by using classification loss; and

optimizing the pre-trained reference network by combining the classification loss and the quintuple loss to obtain a pedestrian re-identification network;

the tuning of the pre-trained reference network by the joint classification loss and the quintuple loss comprises:

inputting five sample pictures of quintuple according to preset requirements and sequence;

calculating a classification loss for each of the prediction vectors output by the reference network for the sample picture;

calculating quintuple loss for the feature vectors output by the five sample pictures based on the reference network; and

calculating a final loss as a loss of the pedestrian re-identification network based on the calculated classification loss and the calculated quintuple loss.

2. The training method of claim 1, wherein the pre-training the reference network with the classification loss comprises:

inputting a sample picture to the reference network;

comparing a prediction vector output by the reference network for the sample picture with a label vector of the sample picture to obtain a classification loss;

adjusting a parameter of the reference network based on the classification loss; and

the above steps are repeated until the classification accuracy and the classification loss are substantially unchanged.

3. Training method according to claim 2, characterized in that the reference network is a residual network.

4. A training method as claimed in claim 3, wherein the sample picture is subjected to a pre-processing operation prior to being input to the reference network.

5. Training method according to claim 1, wherein the calculated classification loss is an average of the classification losses of the five sample pictures.

6. Training method according to claim 1, characterized in that the quintuple loss is defined as:

Wherein l_qtLoss of quintuple; the positive sample 1, the positive sample 2, the negative sample 1, the negative sample 21 and the negative sample 22 are the five sample pictures, the positive sample 1 and the positive sample 2 are two different pictures of a first pedestrian, the negative sample 1 is a picture of a second pedestrian, and the negative sample 21 and the negative sample 22 are two different pictures of a third pedestrian; d is the distance between the feature vectors of the two pictures; a is set according to requirementsIs measured.

7. A training method as claimed in any one of claims 1-6, wherein the final penalty is a weighted sum of the calculated classification penalty and the calculated quintuple penalty.

8. A training device for a pedestrian re-recognition network, the training device comprising:

the pre-training module is used for pre-training the reference network by utilizing the classification loss; and

the tuning module is used for tuning the pre-trained reference network in combination with classification loss and quintuple loss to obtain a pedestrian re-identification network;

the tuning of the pre-trained reference network by the tuning module comprises:

9. The training apparatus of claim 8, wherein the pre-training of the reference network by the pre-training module further comprises:

inputting a sample picture to the reference network;

the above operations are repeated until the classification accuracy and the classification loss are substantially unchanged.

10. Training apparatus according to claim 9, wherein the reference network is a residual network.

11. The training device of claim 10, wherein the pre-training module is further configured to: performing a pre-processing operation on the sample picture prior to inputting the sample picture to the reference network.

12. The training device of claim 8, wherein the calculated classification loss is an average of the classification losses of the five sample pictures.

13. The training apparatus of claim 8, wherein the quintuple loss is defined as:

14. A training apparatus as claimed in any of claims 8-13, characterized in that the final loss is a weighted sum of the calculated classification loss and the calculated quintuple loss.

15. A pedestrian re-identification method, characterized in that the pedestrian re-identification method adopts a pedestrian re-identification network trained by the training method of the pedestrian re-identification network according to any one of claims 1 to 7 to perform pedestrian re-identification.

16. A computing system, characterized in that the system comprises a storage device and a processor, the storage device having stored thereon a computer program for execution by the processor, the computer program, when executed by the processor, performing a training method of a pedestrian re-identification network according to any one of claims 1-7 or performing a pedestrian re-identification method according to claim 15.

17. A storage medium, characterized in that the storage medium has stored thereon a computer program run by a processor, which computer program, when run by the processor, performs a training method of a pedestrian re-recognition network according to any one of claims 1 to 7 or performs a pedestrian re-recognition method according to claim 15.