CN112446270A

CN112446270A - Training method of pedestrian re-identification network, and pedestrian re-identification method and device

Info

Publication number: CN112446270A
Application number: CN201910839017.9A
Authority: CN
Inventors: 魏龙辉; 张天宇; 谢凌曦; 田奇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2021-03-05
Also published as: WO2021043168A1

Abstract

The application provides a training method of a pedestrian re-identification network, a pedestrian re-identification method and a device. Relate to the artificial intelligence field, concretely relates to computer vision field. The method comprises the following steps: acquiring M training images and label data of the M training images; initializing network parameters of a pedestrian re-identification network to obtain initial values of the network parameters of the pedestrian re-identification network; inputting a batch of training images in the M training images into a pedestrian re-identification network for feature extraction to obtain a feature vector of each training image in the batch of training images, then determining a loss function according to the feature vectors of the batch of training images, and obtaining the pedestrian re-identification network meeting the preset requirement according to a function value of the loss function. The pedestrian re-recognition network with better performance can be trained under the condition that the single image shooting device marks data.

Description

Training method of pedestrian re-identification network, and pedestrian re-identification method and device

Technical Field

The present application relates to the field of computer vision, and more particularly, to a training method of a pedestrian re-recognition network, a pedestrian re-recognition method, and an apparatus.

Background

Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, military, etc., and is a study on how to use cameras/image photographing devices and computers to acquire data and information of a photographed object, which are required by us. In a descriptive sense, a computer is provided with eyes (camera/image capturing device) and a brain (algorithm) to recognize, track, measure, etc. an object instead of human eyes, thereby enabling the computer to perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make an artificial system "perceive" from images or multidimensional data. Generally, computer vision is to use various imaging systems to obtain input information instead of visual organs, and then use computer to process and interpret the input information instead of brain. The ultimate research goal of computer vision is to make a computer have the ability to adapt to the environment autonomously by visually observing and understanding the world like a human.

The monitoring field often involves the problem of pedestrian re-identification, which may also be referred to as pedestrian re-identification, which is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence.

The traditional scheme is that training data and mark data of image shooting equipment are striden, training is carried out to pedestrian re-identification network for pedestrian re-identification network can distinguish different pedestrian's image, and then carries out pedestrian's discernment. However, the training data in the conventional scheme includes images of the same pedestrian captured by different image capturing devices, and the images captured by the different image capturing devices need to be manually labeled, so that the images captured by the different image capturing devices of the same pedestrian are associated (i.e., the pedestrian is associated across the image capturing devices). However, in many scenes, it is very difficult to associate pedestrians across image capturing devices, and particularly, when the number of people increases and the number of image capturing devices increases, the difficulty in associating pedestrians across image capturing devices also increases greatly. The economic cost of data annotation is high, and the time consumption is large.

Disclosure of Invention

The application provides a training method of a pedestrian re-identification network, a pedestrian re-identification method and a device, which aim to train the pedestrian re-identification network with better performance under the condition that single image shooting equipment is used for marking data.

In a first aspect, a training method for a pedestrian re-identification network is provided, and the method includes:

step 1: acquiring training data;

the training data in the step 1 comprise M training images and labeling data of the M training images, wherein M is an integer greater than 1;

step 2: initializing the network parameters of the pedestrian re-identification network to obtain initial values of the network parameters of the pedestrian re-identification network;

repeating the following steps 3 to 5 until the pedestrian re-identification network meets the preset requirement;

and step 3: inputting a batch of training images in the M training images into a pedestrian re-identification network for feature extraction to obtain a feature vector of each training image in the batch of training images;

and 4, step 4: determining a function value of a loss function according to the feature vectors of a batch of training images;

and 5: and updating the network parameters of the pedestrian re-identification network according to the function values of the loss functions.

In step 1, in M training images of the training data, each training image includes a pedestrian, the labeling data of each training image includes a bounding box where the pedestrian is located in each training image and pedestrian identification information, different pedestrians correspond to different pedestrian identification information, and in the M training images, the training images with the same pedestrian identification information come from the same image capturing device. The M training images may be all training images used in training the pedestrian re-recognition network, and in a specific training process, a batch of training images of the M training images may be selected each time and input to the pedestrian re-recognition network for processing.

The image capturing device may specifically be a video camera, a still camera, or the like capable of acquiring an image of a pedestrian.

The pedestrian identification information in step 1 may also be referred to as pedestrian identification information, which is a piece of information used for representing the identity of a pedestrian, each pedestrian may correspond to unique pedestrian identification information, and the representation manner of the pedestrian identification information is various as long as the identification information of the pedestrian can be indicated, for example, the pedestrian identification information may specifically be a pedestrian Identity (ID), that is, a unique ID may be assigned to each pedestrian.

In the step 2, the network parameters of the pedestrian re-identification network may be randomly set to obtain initial values of the network parameters of the pedestrian re-identification network.

In step 3, the batch of training images may include N anchor images, where the N anchor images are any N training images in the batch of training images, and each anchor image in the N anchor images corresponds to one of the hardest positive sample image, one of the first hardest negative sample image and one of the second hardest negative sample image.

The most difficult positive sample image, the first most difficult negative sample image and the second most difficult negative sample image corresponding to each anchor point image are described below.

The most difficult positive sample image corresponding to each anchor image: the training images which are the same as the pedestrian identification information of each anchor point image and are farthest from the characteristic vector of each anchor point image in the batch of training images;

the first most difficult negative sample image corresponding to each anchor image: the training images which are from the same image shooting equipment as each anchor point image in the batch of training images, are different from the pedestrian identification information of each anchor point image and are closest to the distance between the characteristic vectors of each anchor point image;

the second most difficult negative sample image corresponding to each anchor image: the training images in the batch of training images and each anchor point image are from different image shooting devices, are different from the pedestrian identification information of each anchor point image and are closest to the distance between the characteristic vectors of each anchor point image.

In step 4, the function values of the loss functions are obtained by averaging the function values of the N first loss functions. The function value of each of the N first loss functions is calculated according to the first difference value and the second difference value corresponding to each of the N anchor point images.

N is a positive integer, and N is less than M. When N is 1, there is only one function value of the first loss function, and the function value of the first loss function may be directly used as the function value of the loss function in step 4.

Optionally, the function value of each first loss function is a sum of the first difference value and the second difference value corresponding to each anchor point image.

Optionally, the function value of each first loss function is a sum of the first difference value, the second difference value and another constant term corresponding to each anchor point image.

The meaning of the first difference and the second difference and the respective distances forming the first difference and the second difference are explained below.

The first difference value corresponding to each anchor point image: the difference between the distance of the hardest positive sample corresponding to each anchor point image and the distance of the second hardest negative sample corresponding to each anchor point image;

the second difference value corresponding to each anchor point image: the difference between the second most difficult negative sample distance corresponding to each anchor point image and the first most difficult negative sample distance corresponding to each anchor point image;

the hardest sample distance corresponding to each anchor image: the distance between the feature vector of the hardest positive sample image corresponding to each anchor point image and the feature vector of each anchor point image;

the second most difficult negative sample distance corresponding to each anchor image: the distance between the feature vector of the second most difficult negative sample image corresponding to each anchor point image and the feature vector of each anchor point image;

the first most difficult negative sample distance corresponding to each anchor image: and the distance between the feature vector of the first most difficult negative sample image corresponding to each anchor point image and the feature vector of each anchor point image.

In addition, in the present application, that several training images are from the same image capturing apparatus means that the several training images are captured by the same image capturing apparatus.

In the method and the device, the most difficult negative sample images from different image shooting devices and the same image shooting device are considered in the process of constructing the loss function, and the first difference value and the second difference value are reduced as much as possible in the training process, so that the interference of the information of the image shooting devices to the image information can be eliminated as much as possible, and the trained pedestrian re-identification network can extract the features from the images more accurately.

Specifically, in the training process of the pedestrian re-identification network, the first difference value and the second difference value are made to be as small as possible by optimizing the network parameters of the pedestrian re-identification network, so that the difference between the distance of the hardest positive sample and the distance of the second hardest negative sample and the difference between the distance of the second hardest negative sample and the distance of the first hardest negative sample are made to be as small as possible, the pedestrian re-identification network can further distinguish the features of the hardest image and the second hardest negative sample image as much as possible, and the features of the second hardest negative sample image and the first hardest negative sample image, and the trained pedestrian re-identification network can better and more accurately extract the features of the images.

With reference to the first aspect, in certain implementations of the first aspect, the pedestrian re-identification network meets a preset requirement, and includes: the pedestrian re-identification network satisfies a preset requirement when at least one of the following conditions (1) to (3) is satisfied:

(1) the training times of the pedestrian re-identification network are greater than or equal to the preset times;

(2) the function value of the loss function is smaller than or equal to a preset threshold value;

(3) the recognition performance of the pedestrian re-recognition network meets the preset requirement.

The preset threshold may be set empirically, and when the preset threshold is set too large, the pedestrian recognition effect of the pedestrian re-recognition network obtained by training may not be good enough, and when the preset threshold is set too small, the function value of the loss function may be difficult to converge during training.

Optionally, a value range of the preset threshold is [0, 0.01 ].

Specifically, the value of the preset threshold may be 0.01.

With reference to the first aspect, in certain implementations of the first aspect, the determining that the function value of the loss function is smaller than or equal to a preset threshold includes: the first difference is smaller than a first preset threshold value, and the second difference is smaller than a second preset threshold value.

The first preset threshold and the second preset threshold may also be determined empirically, where the pedestrian recognition effect of the pedestrian re-recognition network obtained by training may not be good enough when the first preset threshold and the second preset threshold are set too large, and the function value of the loss function may not be converged during training when the first preset threshold and the second preset threshold are set too small.

Optionally, a value range of the first preset threshold is [0, 0.4 ].

Specifically, both the first preset threshold and the second threshold may be 0.1.

With reference to the first aspect, in certain implementations of the first aspect, the M training images are training images from multiple image capturing devices, and label data of the training images from different image capturing devices are individually labeled.

That is to say, the image of each image capturing device can be individually marked without considering whether the same pedestrian appears between different image capturing devices, and specifically, if the multiple images captured by the image capturing device a include the pedestrian X, after the M training images captured by the image capturing device a are marked, it is not necessary to search for an image including the pedestrian X from the images captured by other image capturing devices, so that the process of searching for the same pedestrian in the images captured by different image capturing devices is avoided, a large amount of marking time can be saved, and the complexity of marking can be reduced.

In a second aspect, there is provided a pedestrian re-identification method, the method comprising: acquiring an image to be identified; processing the image to be recognized by utilizing a pedestrian re-recognition network to obtain a feature vector of the image to be recognized, wherein the pedestrian re-recognition network is obtained by training according to the training method of the first aspect; and comparing the feature vector of the image to be recognized with the feature vector of the existing pedestrian image to obtain the recognition result of the image to be recognized.

In the application, the pedestrian re-identification network trained by the training method of the first aspect can better extract the features, so that the pedestrian re-identification network trained by the training method of the first aspect can obtain a better pedestrian identification result when identifying the pedestrian.

With reference to the second aspect, in some implementation manners of the second aspect, the obtaining an identification result of the image to be identified by comparing the feature vector of the image to be identified with the feature vector of the existing pedestrian image includes: and outputting the target pedestrian image and the attribute information of the target pedestrian image.

The target pedestrian image may be a pedestrian image in which a feature vector in an existing pedestrian image is most similar to a feature vector of an image to be recognized, and the attribute information of the target pedestrian image includes shooting time and a shooting position of the target pedestrian image. In addition, the attribute information of the target pedestrian image may further include identity information of a pedestrian and the like.

In a third aspect, there is provided a training device for a pedestrian re-identification network, comprising means for performing the method of the first aspect.

In a fourth aspect, there is provided a pedestrian re-identification apparatus comprising means for performing the method of the second aspect.

In a fifth aspect, there is provided a training device for a pedestrian re-recognition network, the device comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the first aspect when the program stored in the memory is executed.

In a sixth aspect, there is provided a pedestrian re-recognition apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the second aspect when the program stored in the memory is executed.

In a seventh aspect, a computer device is provided, where the computer device includes the training apparatus of the pedestrian re-recognition network in the third aspect.

In the seventh aspect, the computer device may specifically be a server or a cloud device.

In an eighth aspect, an electronic device is provided, which includes the pedestrian re-recognition apparatus of the fourth aspect.

In the eighth aspect, the electronic device may be a mobile terminal (e.g., a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, an in-vehicle terminal device, and the like.

In a ninth aspect, there is provided a computer readable storage medium having stored program code comprising instructions for performing the steps of the method of any one of the first or second aspects.

A tenth aspect provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first or second aspects.

In an eleventh aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform any one of the methods in the first aspect or the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute any one of the methods in the first aspect or the second aspect.

The chip can be specifically a field programmable gate array FPGA or an application specific integrated circuit ASIC.

It should be understood that, in the present application, the method of the first aspect may specifically refer to the method of the first aspect and any one of various implementations of the first aspect, and the method of the second aspect may specifically refer to the method of the second aspect and any one of various implementations of the second aspect.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of pedestrian re-identification using a convolutional neural network model provided by an embodiment of the present application;

fig. 3 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a system architecture provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of one possible application scenario of an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating an overall training method of a pedestrian re-identification network according to an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of a training method of a pedestrian re-identification network of an embodiment of the present application;

FIG. 8 is a schematic diagram of a process for determining a function value of a loss function;

FIG. 9 is a schematic flow chart diagram of a pedestrian re-identification method of an embodiment of the present application;

FIG. 10 is a schematic block diagram of a training apparatus of a pedestrian re-identification network of an embodiment of the present application;

FIG. 11 is a schematic block diagram of a training apparatus of a pedestrian re-identification network of an embodiment of the present application;

fig. 12 is a schematic block diagram of a pedestrian re-identification apparatus of an embodiment of the present application;

fig. 13 is a schematic block diagram of a pedestrian re-identification apparatus of an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The scheme of the application can be applied to the fields of city monitoring, safe cities and the like.

Specifically, the method and the device can be applied to a scene of searching people for the intelligent monitoring system, and the application in the scene is introduced below.

The intelligent monitoring system searches people:

taking an intelligent monitoring system deployed in a certain park as an example, the intelligent monitoring system can acquire images of pedestrians shot by various image shooting devices to form an image library. Next, the pedestrian re-recognition network (also referred to as a pedestrian re-recognition model) may be trained by using the images in the image library, so as to obtain a trained pedestrian re-recognition network.

Then, the trained pedestrian re-identification network can be used for extracting the feature vector of the collected pedestrian image. When a person is suspicious in the track or other conditions that the pedestrian needs to be tracked across lenses exist, the feature vectors of the pedestrian images acquired by the pedestrian re-identification network can be compared with the feature vectors of the images in the image library, the pedestrian images with the most similar feature vectors are returned, and basic information such as shooting time and positions of the images is given. And after subsequent checking and screening, the person searching process can be completed.

In the present application, the pedestrian re-identification network may be a neural network (model), and for better understanding of the present application, the related terms and concepts of the neural network will be described below.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which can be as shown in equation (1):

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is the activation functions (activations functions) of the neural unit that are used to non-linearly transform features in the neural network to convert input signals in the neural unit into output signals. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example, assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Residual error network

The residual network is a deep convolutional network proposed in 2015, which is easier to optimize than the conventional convolutional neural network and can improve accuracy by increasing the equivalent depth. The core of the residual network is to solve the side effect (degradation problem) caused by increasing the depth, so that the network performance can be improved by simply increasing the network depth. The residual network generally includes many sub-modules with the same structure, and a residual network (ResNet) is usually used to connect a number to indicate the number of times that the sub-module is repeated, for example, ResNet50 indicates that there are 50 sub-modules in the residual network.

(6) Classifier

Many neural network architectures have a classifier for classifying objects in the image. The classifier is generally composed of a fully connected layer (called normalized exponential function) and a softmax function (called normalized exponential function), and is capable of outputting probabilities of different classes according to inputs.

(7) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(8) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the numerical values of the parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

The system architecture of the embodiment of the present application is described in detail below with reference to fig. 1.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 1, the system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data collection system 160.

In addition, the execution device 110 includes a calculation module 111, an I/O interface 112, a preprocessing module 113, and a preprocessing module 114. Wherein, the calculation module 111 may include the target model/rule 101, and the pre-processing module 113 and the pre-processing module 114 are optional.

The data acquisition device 160 is used to acquire training data. For the training method of the pedestrian re-identification network in the embodiment of the application, the training data may include M training images and label data of the M training images. After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The following describes that the training device 120 obtains the target model/rule 101 based on the training data, the training device 120 performs feature extraction on the input training image to obtain a feature vector of the training image, and repeats the feature extraction on the input training image until the function value of the loss function meets a preset requirement (is less than or equal to a preset threshold), thereby completing the training of the target model/rule 101.

It should be appreciated that the training of the target model/rule 101 described above may be an unsupervised training.

The above target model/rule 101 can be used to implement the pedestrian re-identification method of the embodiment of the present application, that is, inputting a pedestrian image (the pedestrian image may be an image that needs to be identified as a pedestrian) into the target model/rule 101, so as to obtain feature vectors extracted from the pedestrian image, perform pedestrian identification based on the extracted feature vectors, and determine the identification result of the pedestrian. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the collection of the data collection device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: pedestrian images input by the client device. The client device 140 here may specifically be a monitoring device.

The preprocessing module 113 and the preprocessing module 114 are used for preprocessing input data (such as pedestrian images) received by the I/O interface 112, and in the embodiment of the present application, there may be no preprocessing module 113 and the preprocessing module 114 or only one preprocessing module. When the preprocessing module 113 and the preprocessing module 114 are not present, the input data may be directly processed using the calculation module 111.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 presents the processing result (specifically, a high-quality image obtained by pedestrian re-recognition), such as the recognition result of the image to be recognized obtained by performing pedestrian re-recognition processing on the pedestrian image by the target model/rule 101, to the client device 140, so as to provide the user with the result.

Specifically, the high-quality image obtained by re-identifying the pedestrian through the object model/rule 101 in the calculation module 111 may be processed (for example, image rendering processing) by the preprocessing module 113 (or by the preprocessing module 114), and then the processing result is sent to the I/O interface, and then the I/O interface sends the processing result to the client device 140 for display.

It should be understood that when the preprocessing module 113 and the preprocessing module 114 are not present in the system architecture 100, the computing module 111 can also transmit the high-quality image obtained through the pedestrian re-identification process to the I/O interface, and then the I/O interface sends the processing result to the client device 140 for display.

It is worth noting that the training device 120 may be configured to generate corresponding target models/rules 101 based on different training data for different targets or different tasks (for example, the training device may be configured to train for real high-quality images and approximate low-quality images in different scenes), and the corresponding target models/rules 101 may be configured to achieve the targets or complete the tasks, so as to provide the user with a desired result.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 1, the target model/rule 101, which may be a neural network (model), is trained according to the training device 120. Specifically, the neural network (model) may be CNN, Deep Convolutional Neural Networks (DCNN), and the like.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 2. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 2, a Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230. The relevant contents of these layers are described in detail below.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 2 may include layers such as example 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The inner working principle of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the sizes of the convolution feature maps extracted by the plurality of weight matrices having the same size are also the same, and the extracted plurality of convolution feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 2, may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the fully-connected layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) and an output layer 240 may be included in the fully-connected layer 230, and parameters included in the hidden layers may be pre-trained according to the related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the fully-connected layer 230, i.e., the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e., the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

It should be understood that the Convolutional Neural Network (CNN)200 shown in fig. 2 may be used to perform the pedestrian re-identification method in the embodiment of the present application, as shown in fig. 2, after the pedestrian image is processed by the input layer 210, the convolutional layer/pooling layer 220, and the full connection layer 230, the image characteristics of the image to be identified may be obtained, and then the identification result of the image to be identified may be obtained again according to the image characteristics of the image to be identified.

Fig. 3 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithms for the various layers in the convolutional neural network shown in fig. 2 can all be implemented in a chip as shown in fig. 3.

A neural-Network Processing Unit (NPU) 50 is mounted as a coprocessor on a main CPU (CPU) (host CPU), and tasks are allocated by the main CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit 503 fetches the data corresponding to the matrix B from the weight memory 502 and buffers it in each PE in the arithmetic circuit 503. The arithmetic circuit 503 takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit can 507 store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

The unified memory 506 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 505 (DMAC) to transfer the input data in the external memory to the input memory 501 and/or the unified memory 506, store the weight data in the external memory in the weight memory 502, and store the data in the unified memory 506 in the external memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.

Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are on-chip memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.

In addition, in the present application, the operations of the layers in the convolutional neural network shown in fig. 2 may be performed by the operation circuit 503 or the vector calculation unit 507.

As shown in fig. 4, the present embodiment provides a system architecture 300. The system architecture includes a local device 301, a local device 302, and an execution device 210 and a data storage system 250, wherein the local device 301 and the local device 302 are connected with the execution device 210 through a communication network.

The execution device 210 may be implemented by one or more servers. Optionally, the execution device 210 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The execution device 210 may use data in the data storage system 250 or call program code in the data storage system 250 to implement the pedestrian re-identification method of the embodiment of the present application.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 210 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 301 and the local device 302 acquire the relevant parameters of the target neural network from the execution device 210, deploy the target neural network on the local device 301 and the local device 302, and perform pedestrian re-identification by using the target neural network.

In another implementation, the execution device 210 may directly deploy a target neural network, and the execution device 210 acquires the pedestrian image from the local device 301 and the local device 302 (the local device 301 and the local device 302 may upload the pedestrian image to the execution device 210), performs pedestrian re-identification on the pedestrian image according to the target neural network, and transmits a high-quality image obtained by the pedestrian re-identification to the local device 301 and the local device 302.

The execution device 210 may also be referred to as a cloud device, and in this case, the execution device 210 is generally deployed in the cloud.

Fig. 5 is a schematic diagram of a possible application scenario of an embodiment of the present application.

As shown in fig. 5, in the present application, a pedestrian re-identification network may be trained through a single image capturing device, so as to obtain a trained pedestrian re-identification network, the trained pedestrian re-identification network may process a pedestrian image, so as to obtain a feature vector of the pedestrian image, and then, a person to be searched may be obtained by performing feature comparison between the feature vector of the pedestrian image and the feature vector in the image library. Specifically, a target pedestrian image most similar to the feature vector of the pedestrian image can be found through feature comparison, and basic information such as shooting time and position of the target pedestrian image is output.

It should be understood that the feature vectors of the respective pedestrian images and the related information of the pedestrians corresponding to the pedestrian images are stored in the image library.

It should be understood that the single image capture device annotation data herein can include a plurality of training images and annotation data for a plurality of training images. The single-image shooting equipment marking data refers to that the training images acquired by each image shooting equipment are marked independently, whether the same pedestrians appear or not is not required to be searched among different image shooting equipment, the marking mode does not need to pay attention to the relation among the training images shot by different image shooting equipment, a large amount of marking time can be saved, and the marking complexity is reduced. The plurality of training images and the label data of the plurality of training images may be collectively referred to as training data.

Fig. 6 is a general flowchart of a training method of a pedestrian re-identification network according to an embodiment of the present application.

As shown in fig. 6, single image capturing device annotation data can be obtained by performing individual data annotation on the video image acquired by each image capturing device. The greatest advantage of the annotation data of the single image shooting device is that the annotation and collection are easy, and in the application, the annotation data of the single image shooting device does not require the same pedestrian to appear under a plurality of image shooting devices.

In the single image shooting equipment marking data, each pedestrian is supposed to appear in only one image shooting equipment (or one image shooting equipment group), so that after the pedestrian images are obtained in the video by utilizing pedestrian detection and tracking, a plurality of images of the same person in the similar frames can be related by only needing little manpower to form the marking. Moreover, the labels of each image shooting device are relatively independent, and the numbers of pedestrians under different image shooting devices cannot be overlapped. By setting different acquisition time periods for different image shooting devices, the number of people repeatedly appearing in the video acquired by each image shooting device is small, and the requirement of marking data of the single image shooting device is met.

In some relatively small scenes (e.g., an office park), most people have small ranges of motion and a considerable number of people are present in only one image capture device group, which naturally meets the requirement. Because the fields of view of the cameras in a group of image capture devices are close or overlap, the lighting conditions are similar and the cameras can be substantially equivalent to a single camera.

After the single image shooting equipment marking data are obtained, the single image shooting equipment marking data can be used for training a pedestrian re-identification network (model), and the pedestrian re-identification network obtained through training can be used for testing and deployment. Specifically, the trained pedestrian re-identification network can be used for executing the pedestrian re-identification method of the embodiment of the application.

Fig. 7 is a schematic flowchart of a training method of a pedestrian re-identification network according to an embodiment of the present application. The method shown in fig. 7 may be performed by a training device of a pedestrian re-identification network according to an embodiment of the present application (for example, may be performed by the devices shown in fig. 10 and 11), and the method shown in fig. 7 includes steps 1001 to 1008, which are described in detail below.

1001. And starting.

Step 1001 represents the start of the training process of the pedestrian re-identification network.

1002. Training data is acquired.

The training data in the step 1002 includes M training images (M is an integer greater than 1) and labeling data of the M training images, where in the M training images, each training image includes a pedestrian, the labeling data of each training image includes a bounding box where the pedestrian is located in each training image and pedestrian identification information, different pedestrians correspond to different pedestrian identification information, and in the M training images, the training images with the same pedestrian identification information are from the same image capturing device.

The pedestrian identification information in step 1002 may also be referred to as pedestrian identification information, which is a kind of information used for representing the identity of a pedestrian, each pedestrian may correspond to unique pedestrian identification information, and the representation manner of the pedestrian identification information is various as long as the identification information of the pedestrian can be indicated, for example, the pedestrian identification information may specifically be a pedestrian Identity (ID), that is, a unique ID may be assigned to each pedestrian.

1003. And initializing the network parameters of the pedestrian re-identification network to obtain initial values of the network parameters of the pedestrian re-identification network.

In the step 1003, the network parameters of the pedestrian re-identification network may be randomly set to obtain initial values of the network parameters of the pedestrian re-identification network.

1004. And inputting a batch of training images in the M training images into a pedestrian re-identification network for feature extraction to obtain a feature vector of each training image in the batch of training images.

The batch of training images are part of the M training images, when the M training images are adopted to train the pedestrian re-identification network, the M training images can be divided into different batches to train the pedestrian re-identification network, and the number of the training images in each batch can be the same or different.

For example, there are 5000 training images, and 100 training images can be input in each batch to train the pedestrian re-identification network.

The batch of training images may include N anchor images, where the N anchor images are any N training images in the batch of training images, each anchor image in the N anchor images corresponds to a most difficult positive sample image, a first most difficult negative sample image and a second most difficult negative sample image, N is a positive integer, and N is less than M.

1005. And determining the function value of the loss function according to the characteristic vectors of the training images.

The function values of the loss functions in the above step 1005 are obtained by averaging the function values of the N first loss functions.

The function value of each of the N first loss functions is calculated according to the first difference value and the second difference value corresponding to each of the N anchor point images.

N is a positive integer, and N is less than M. When N is 1, there is only one function value of the first loss function, and the function value of the first loss function may be directly used as the function value of the loss function in step 1005.

For example, the function value of the first loss function may be as shown in equation (2).

L1＝D1+D2 (2)

Where L1 represents the function value of the first loss function, D1 represents the first difference, and D2 represents the second difference.

Optionally, the function value of each first loss function is a sum of an absolute value of the first difference and an absolute value of the second difference corresponding to each anchor point image.

For example, the function value of the first loss function may be as shown in equation (3).

L1＝|D1|+|D2| (3)

Where L1 represents a function value of the first loss function, | D1| represents an absolute value of the first difference, and | D2| represents an absolute value of the second difference.

For example, the function value of the first loss function may be as shown in equation (4).

L1＝D1+D2+m (4)

Where L1 represents a function value of the first loss function, D1 represents the first difference, D2 represents the second difference, m represents a constant, and the magnitude of m can be set to an appropriate value empirically.

As another example, the function value of the first loss function may be as shown in equation (5).

L1＝|m₁+D1|+|m₂+D2| (5)

Wherein L1 represents a function value of the first loss function, | D1| represents the first difference, | D2| represents the second difference, and m₁And m₂Represents a constant, m₁And m₂Can be set to an appropriate value empirically.

It will be appreciated that the above absolute values of D1 and D2 in calculating the function values of the first loss function are only an alternative implementation, and that in practice other operations may be performed on D1 and D2 in determining the function values of the first loss function, for example, X may be performed on D1 and D2]₊An operation (this operation may be referred to as an operation on the function value taking section).

Wherein when X is greater than 0, [ X]₊When X is less than 0, [ X ═ X]₊0. (see, in particular, https:// en. wikipedia. org/wiki/Positive _ and _ negative _ parts)

The meaning of the first difference and the second difference will be explained below.

the second difference value corresponding to each anchor point image: a difference between the second most difficult negative sample distance corresponding to each anchor image and the first most difficult negative sample distance corresponding to each anchor image.

The first difference and the second difference are obtained by subtracting different distances. The meaning of these distances is explained below.

Specifically, if the number of image capturing devices corresponding to one batch of images in the training process is C, the number of pedestrians under each image capturing device is P, and the number of images of each pedestrian is K, the number of images of one batch is C × P × K. The anchor point image in the image of the batch is recorded as

Note the book

For the features (vectors) output by the network model, remember | | f₁-f₂| is two features f₁And f₂The above-mentioned hardest positive sample distance can be as shown in equation (6).

The second hardest negative sample distance may be as shown in equation (7).

The first hardest negative sample distance may be as shown in equation (8).

The first difference value may be a difference between formula (6) and formula (7), and the second difference value may be a difference between formula (7) and formula (8).

The loss function formed by the first difference and the second difference may be as shown in equation (9). Wherein L represents a loss function, m₁And m₂Are two constants, and the specific values can be set according to experience. E.g. m₁＝0.1，m₂＝0.1。

In the above-mentioned formula (9),

presentation pair

Is subjected to [ X ]]₊Operation when

When the value of (a) is greater than or equal to 0,

is that

When in use

When the value of (A) is less than 0,

the value of (a) is 0.

In the above-mentioned formula (9),

presentation pair

Is subjected to [ X ]]₊Operation when

When the value of (a) is greater than or equal to 0,

is that

When in

When the concentration of the carbon dioxide is less than 0,

the value of (a) is 0.

1006. And updating the network parameters of the pedestrian re-identification network according to the function values of the loss functions.

Specifically, the network parameters of the pedestrian re-identification network may be updated according to the function values of the loss functions shown in the above equation (9). And the function value of the loss function shown in the formula (9) is made smaller and smaller in the updating process.

1007. And determining whether the pedestrian re-identification network meets the preset requirement.

Optionally, the pedestrian re-identification network meets a preset requirement, including: the pedestrian re-identification network satisfies at least one of the following conditions:

(1) the pedestrian recognition performance of the pedestrian re-recognition network meets the preset performance requirement;

(2) the updating times of the network parameters of the pedestrian re-identification network are greater than or equal to the preset times;

(3) the function value of the loss function is less than or equal to a preset threshold.

In step 1007, when the pedestrian re-identification network meets at least one of the above conditions (1) to (3), it may be determined that the pedestrian re-identification network meets the preset requirement, step 1008 is executed, and the training process of the pedestrian re-identification network is ended; when the pedestrian re-identification network does not satisfy any of the above conditions (1) to (3), it indicates that the pedestrian re-identification network does not satisfy the preset requirement, and it is necessary to continue training the pedestrian re-identification network, that is, to re-execute steps 1004 to 1007 until the pedestrian re-identification network satisfying the preset requirement is obtained.

Optionally, a value range of the preset threshold is [0, 0.01 ].

Specifically, the value of the preset threshold may be 0.01.

The function value of the loss function is less than or equal to a preset threshold, and specifically includes: the first difference is smaller than a first preset threshold value, and the second difference is smaller than a second preset threshold value.

Optionally, a value range of the first preset threshold is [0, 0.4 ].

1008. And finishing the training.

The process of determining the function value of the loss function from a batch of training images in steps 1004 and 1005 described above is described in detail below with reference to fig. 8.

As shown in fig. 8, after a batch of training images is input to the pedestrian re-recognition network, feature vectors of the batch of training images can be obtained. Next, a plurality of anchor images may be selected from the batch of training images, and a corresponding most difficult positive sample image, a first most difficult negative sample image, and a second most difficult negative sample image may be determined for each anchor image of the plurality of anchor images.

Thus, a plurality of training image groups consisting of four training images (anchor images, the hardest positive sample image corresponding to the anchor image, the first hardest negative sample image corresponding to the anchor image and the second hardest negative sample image corresponding to the anchor image) can be obtained, and then a first loss function can be determined according to the distance relationship among the feature vectors of the training images in each training image group.

As shown in fig. 8, there are N training image groups in total, N first loss functions can be determined from the N training image groups in total, and then the function values of the N first loss functions are averaged to obtain the function value of the loss function in step 1005.

It should be understood that the N training image groups contain N anchor images, which are different from each other, that is, each training image group corresponds to a unique anchor image. However, the other training images (other than the anchor image) contained in the different sets of training images may be the same. For example, the hardest positive sample images in the first training image set are the same as the hardest positive sample images in the second training set.

For another example, assuming that the number of the batch of training images is 100, 10 (or other numbers, which are only illustrated as 10) anchor point images may be selected from the 100 training images, and then the corresponding hardest positive sample image, the first hardest negative sample image and the second hardest negative sample image are respectively selected for each anchor point image from the 100 training images. Then, 10 training image groups are obtained, and the function values of the 10 first loss functions are obtained from the 10 training image groups, and then, the function values of the loss functions in step 1005 are obtained by averaging the function values of the 10 first loss functions.

The design and training process of the pedestrian re-identification network will be described in detail below.

The pedestrian re-identification network in the application may adopt an existing residual error network (for example, adopting ResNet50) as a network main body, remove a last full connection layer, add a global average pooling (global average pooling) layer after a last layer of residual error block (ResBlock), and output a feature vector with 2048 dimensions (or other values) as a network model.

In each batch of training images, each camera can collect 4 people, each person collects 8 images, and if the number of images of one person is less than 8, the collection is repeated to fill 8 images.

When the trained pedestrian re-identification network is trained, the above equation (9) can be used as a loss function, and when a test is performed, the number of cameras in different data sets can be different, for example, for a DukeMTMC-reID data set, there are 8 cameras, where C in equation (9) is 8; for the Market-1501 data set, there are 6 cameras, when C in equation (9) is 6.

Two parameters in the loss function shown in the above equation (9) may be m₁＝0.1，m₂0.1. The input training image can be scaled to 256x128 pixel size, the network parameters can be trained using an adaptive moment estimation (Adam) optimizer during training, and the learning rate can be set to 2 x 10^-4. After 100 rounds of training, the learning rate decays exponentially until 200 rounds of post-learning rate can be set to 2 × 10^-7Training may be stopped at this point.

The pedestrian re-identification network obtained by training according to the training method of the pedestrian re-identification network in the embodiment of the present application can be used for executing the pedestrian re-identification method in the embodiment of the present application, and the pedestrian re-identification method in the embodiment of the present application is described below with reference to the accompanying drawings.

Fig. 9 is a schematic flowchart of a pedestrian re-identification method according to an embodiment of the present application. The pedestrian re-identification method shown in fig. 9 may be performed by the pedestrian re-identification apparatus of the embodiment of the present application (for example, may be performed by the apparatuses shown in fig. 12 and 13), and the pedestrian re-identification method shown in fig. 9 includes steps 2001 to 2003, which are described in detail below with respect to steps 2001 to 2003.

2001. And acquiring an image to be identified.

2002. And processing the image to be recognized by utilizing the pedestrian re-recognition network to obtain the characteristic vector of the image to be recognized.

The pedestrian re-identification network adopted in step 2002 may be trained according to a training method of the pedestrian re-identification network according to the embodiment of the present application, and specifically, the pedestrian re-identification network in step 2002 may be trained by the method shown in fig. 7.

2003. And comparing the feature vector of the image to be recognized with the feature vector of the existing pedestrian image to obtain the recognition result of the image to be recognized.

In the application, the pedestrian re-recognition network obtained by training through the training method of the pedestrian re-recognition network can better extract the features, so that the pedestrian re-recognition network is used for processing the image to be recognized, and a better pedestrian recognition result can be obtained.

Optionally, the step 2003 specifically includes: comparing the feature vector of the image to be identified with the feature vector of the existing pedestrian image to determine an output target pedestrian image; and outputting the target pedestrian image and the attribute information of the target pedestrian image.

The following describes the effect of pedestrian recognition by the pedestrian re-recognition network according to the embodiment of the present application with reference to specific test results.

TABLE 1

Table 1 shows the results of tests performed on different data sets by different schemes, where the test results include Rank-1 and mean average precision (mAP), where Rank-1 represents the probability that an image with a closest distance between a feature vector and a feature vector of an image to be identified in an existing image and the image to be identified belong to the same pedestrian.

In Table 1 above, data set 1 is Duke-SCT and data set 2 is Market-SCT.

Wherein Duke-SCT is a subset of DukeMTMC-reiD data set, and Market-SCT is a subset of Market-1501 data set. When acquiring Duke-SCT and Market-SCT, we process the training data from the original data set (Duke MTMC-reiD and Market-1501) as follows: and randomly selecting an image under one camera by each pedestrian for reservation (different pedestrians may select different cameras), so as to obtain and form a new data set Duke-SCT and Market-SCT. At the same time, the test set remains unchanged.

The existing scheme 1: a deep discriminative feature learning method (a discriminative feature learning for depth) published in the european computer vision international conference (ECCV) in 2016;

existing scheme 2: deep hypersphere embedding for face recognition, which was published in the international computer vision and pattern recognition Conference (CVPR) in 2017;

existing scheme 3: additional angular edge loss for deep face recognition (advanced angular margin loss for face recognition), published in the CVPR in 2019;

existing scheme 4: person search (person retrieval with refined part of the pool) which was published in ECCV in 2018;

existing scheme 5: partially aligned bilinear representations (part-aligned bilinear representations for person re-identification), published in ECCV in 2018;

existing scheme 6: learning discriminant features with multiple granularities for person re-Identification (learning discriminant features with multiple granularities for person re-Identification) was published in 2018 at the american computer association multimedia conference (association for computing international conference on multimedia, ACMMM).

As can be seen from Table 1, the Rank-1 and mAP of the scheme of the application, whether in the data set 1 or the data set 2, are superior to those of the existing scheme, and have better identification effect.

Fig. 10 is a schematic block diagram of a training device of a pedestrian re-recognition network according to an embodiment of the present application. The training device 8000 of the pedestrian re-recognition network shown in fig. 10 includes an acquisition unit 8001 and a training unit 8002.

The acquisition unit 8001 and the training unit 8002 may be configured to execute a training method of the pedestrian re-recognition network according to the embodiment of the present application.

Specifically, the acquisition unit 8001 may perform the above steps 1001 and 1002, and the training unit 8002 may perform the above steps 1003 to 1008.

The acquisition unit 8001 in the apparatus 8000 shown in fig. 10 may correspond to the communication interface 9003 in the apparatus 9000 shown in fig. 11, and the corresponding training image may be acquired through the communication interface 9003, or the acquisition unit 8001 may correspond to the processor 9002, and at this time, the training image may be acquired from the memory 9001 by the processor 9002, or the training image may be acquired from the outside through the communication interface 9003. Further, the training unit 8002 in the apparatus 8000 may correspond to the processor 9002 in the apparatus 9000.

Fig. 11 is a hardware configuration diagram of a training device of a pedestrian re-identification network according to an embodiment of the present application. A training apparatus 9000 of a pedestrian re-identification network shown in fig. 11 (the apparatus 9000 may specifically be a computer device) includes a memory 9001, a processor 9002, a communication interface 9003, and a bus 9004. The memory 9001, the processor 9002 and the communication interface 9003 are communicatively connected to each other via a bus 9004.

The memory 9001 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 9001 may store a program, and the processor 9002 is configured to perform the steps of the training method of the pedestrian re-recognition network of the embodiment of the present application when the program stored in the memory 9001 is executed by the processor 9002.

The processor 9002 may be a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement the training method for the pedestrian re-identification network according to the embodiment of the present disclosure.

The processor 9002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method for the pedestrian re-identification network of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 9002.

The processor 9002 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 9001, and the processor 9002 reads information in the memory 9001, and in combination with hardware thereof, performs functions required to be performed by units included in the training apparatus for a pedestrian re-identification network, or performs a training method for a pedestrian re-identification network according to an embodiment of the present application.

Communication interface 9003 enables communication between apparatus 9000 and other devices or communication networks using transceiver means, such as, but not limited to, a transceiver. For example, the image to be identified may be acquired through the communication interface 9003.

The bus 9004 can include a pathway to transfer information between various components of the apparatus 9000 (e.g., memory 9001, processor 9002, communication interface 9003).

Fig. 12 is a schematic block diagram of a pedestrian re-identification apparatus of an embodiment of the present application. The pedestrian re-identification apparatus 10000 shown in fig. 12 includes an acquisition unit 10001 and an identification unit 10002.

The obtaining unit 10001 and the identifying unit 10002 may be configured to execute the pedestrian re-identification method according to the embodiment of the present application.

Specifically, the obtaining unit 10001 may perform the step 6001, and the identifying unit 10002 may perform the step 6002.

The acquiring unit 10001 in the apparatus 10000 shown in fig. 12 may correspond to the communication interface 11003 in the apparatus 11000 shown in fig. 13, and the image to be recognized may be acquired through the communication interface 11003, or the acquiring unit 10001 may also correspond to the processor 11002, and at this time, the image to be recognized may be acquired from the memory 11001 by the processor 11002, or the image to be recognized may be acquired from the outside through the communication interface 11003.

The identification means 10002 in the device 10000 shown in fig. 12 corresponds to the processor 11002 in the device 11000 shown in fig. 13.

Fig. 13 is a hardware configuration diagram of the pedestrian re-identification apparatus according to the embodiment of the present application. Similar to the above-described apparatus 10000, the pedestrian re-identification apparatus 11000 shown in fig. 13 includes a memory 11001, a processor 11002, a communication interface 11003, and a bus 11004. The memory 11001, the processor 11002, and the communication interface 11003 are communicatively connected to each other via a bus 11004.

The memory 11001 may be ROM, static storage device, and RAM. The memory 11001 may store programs, and the processor 11002 and the communication interface 11003 are used to perform the steps of the pedestrian re-identification method of the embodiment of the present application when the programs stored in the memory 11001 are executed by the processor 11002.

The processor 11002 may employ a general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits for executing related programs to implement the functions required to be performed by the units in the pedestrian re-identification apparatus of the embodiment of the present application, or to execute the pedestrian re-identification method of the embodiment of the present application.

The processor 11002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the pedestrian re-identification method according to the embodiment of the present application may be implemented by hardware integrated logic circuits or instructions in software in the processor 11002.

The processor 11002 may also be a general purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 11001, and the processor 11002 reads information in the memory 11001, and completes functions to be executed by units included in the pedestrian re-identification apparatus according to the embodiment of the present application, or executes the pedestrian re-identification method according to the embodiment of the present application, in combination with hardware thereof.

The communication interface 11003 enables communication between the apparatus 11000 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, the image to be recognized may be acquired through the communication interface 11003.

The bus 11004 may include a pathway to transfer information between various components of the device 11000 (e.g., memory 11001, processor 11002, communication interface 11003).

It should be noted that although the apparatus 9000 and the apparatus 11000 described above show only memories, processors, and communication interfaces, in particular implementations, those skilled in the art will appreciate that the apparatus 9000 and the apparatus 11000 may also include other devices necessary for normal operation. Also, the apparatus 9000 and the apparatus 11000 may comprise hardware components to perform other additional functions, as may be appreciated by those skilled in the art according to particular needs. Further, those skilled in the art will appreciate that the apparatus 9000 and the apparatus 11000 may also comprise only the devices necessary to implement the embodiments of the present application, and not necessarily all of the devices shown in fig. 11 and 13.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A training method of a pedestrian re-identification network is characterized by comprising the following steps:

step 1: acquiring M training images and marking data of the M training images, wherein each training image in the M training images comprises a pedestrian, the marking data of each training image comprises an enclosure where the pedestrian is located and pedestrian identification information, different pedestrians correspond to different pedestrian identification information, in the M training images, the training images with the same pedestrian identification information come from the same image shooting device, and M is an integer greater than 1;

and step 3: inputting a batch of training images in the M training images into the pedestrian re-identification network for feature extraction to obtain a feature vector of each training image in the batch of training images;

the method comprises the following steps that a batch of training images are obtained, wherein the batch of training images comprise N anchor point images, the N anchor point images are any N training images in the batch of training images, each anchor point image in the N anchor point images corresponds to a most difficult positive sample image, a first most difficult negative sample image and a second most difficult negative sample image, and N is a positive integer;

the hardest sample image corresponding to each anchor image is the same as the pedestrian identification information of each anchor image in the training image batch, and the training image with the farthest distance to the feature vector of each anchor point image, wherein the first hardest negative sample image corresponding to each anchor point image is from the same image shooting device as each anchor point image in the batch of training images, and a training image different from the pedestrian identification information of said each anchor image and closest to the distance between the feature vectors of said each anchor image, the second most difficult negative sample image corresponding to each anchor image is from a different image capture device than each anchor image in the batch of training images, the training image is different from the pedestrian identification information of each anchor point image and is closest to the distance between the characteristic vectors of each anchor point image;

and 4, step 4: determining a function value of a loss function according to the feature vectors of the batch of training images, wherein the function value of the loss function is obtained by averaging the function values of the N first loss functions;

wherein a function value of each first loss function in the N first loss functions is calculated according to a first difference value and a second difference value corresponding to each anchor point image in the N anchor point images, the first difference value corresponding to each anchor point image is a difference between a hardest positive sample distance corresponding to each anchor point image and a second hardest negative sample distance corresponding to each anchor point image, the second difference value corresponding to each anchor point image is a difference between a second hardest negative sample distance corresponding to each anchor point image and a first hardest negative sample distance corresponding to each anchor point image, the hardest positive sample distance corresponding to each anchor point image is a distance between a feature vector of the hardest positive sample image corresponding to each anchor point image and a feature vector of each anchor point image, and the second hardest negative sample distance corresponding to each anchor point image is a feature vector of the second hardest negative sample image corresponding to each anchor point image and a feature vector of the second hardest negative sample image corresponding to each anchor point image The distance of the feature vector of each anchor point image, wherein the first hardest negative sample distance corresponding to each anchor point image is the distance between the feature vector of the first hardest negative sample image corresponding to each anchor point image and the feature vector of each anchor point image;

and 5: updating the network parameters of the pedestrian re-identification network according to the function values of the loss functions;

and repeating the steps 3 to 5 until the pedestrian re-identification network meets the preset requirement.

2. The training method of claim 1, wherein the pedestrian re-identification network meets a preset requirement, comprising:

the pedestrian re-identification network satisfies a preset requirement when at least one of the following conditions is satisfied:

the training times of the pedestrian re-identification network are greater than or equal to the preset times;

the function value of the loss function is smaller than or equal to a preset threshold value;

the identification performance of the pedestrian re-identification network meets the preset requirement.

3. The training method of claim 2, wherein the function value of the loss function is less than or equal to a preset threshold, comprising:

the first difference is smaller than a first preset threshold, and the second difference is smaller than a second preset threshold.

4. A training method as claimed in any one of claims 1 to 3, characterized in that the M training images are training images from a plurality of image capturing devices, wherein the label data of the training images from different image capturing devices are individually labeled.

5. A pedestrian re-identification method, comprising:

acquiring an image to be identified;

processing an image to be recognized by utilizing a pedestrian re-recognition network to obtain a feature vector of the image to be recognized, wherein the pedestrian re-recognition network is obtained by training according to the training method of any one of claims 1-4;

and comparing the feature vector of the image to be recognized with the feature vector of the existing pedestrian image to obtain the recognition result of the image to be recognized.

6. A training device for a pedestrian re-recognition network, comprising:

an acquisition unit for performing step 1;

a training unit for performing step 2;

the training unit is also used for repeatedly executing the steps 3 to 5 until the pedestrian re-identification network meets the preset requirement;

7. The training device of claim 6, wherein the pedestrian re-identification network meets a predetermined requirement, comprising:

8. The training device of claim 7, wherein the function value of the loss function is less than or equal to a preset threshold, comprising:

9. An exercise apparatus according to any one of claims 6-8, wherein the M exercise images are exercise images from multiple image capture devices, wherein label data from exercise images from different image capture devices are individually labeled.

10. A pedestrian re-recognition apparatus, comprising:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be recognized;

the identification unit is used for processing an image to be identified by utilizing a pedestrian re-identification network to obtain a feature vector of the image to be identified, wherein the pedestrian re-identification network is obtained by training according to the training method of any one of claims 1-4;

the identification unit is further used for comparing the feature vector of the image to be identified with the feature vector of the existing pedestrian image to obtain an identification result of the image to be identified.

11. A computer-readable storage medium, characterized in that the computer-readable medium stores a program code for execution by a device, the program code comprising instructions for performing the training method of any one of claims 1-4.

12. A computer-readable storage medium, characterized in that the computer-readable medium stores a program code for device execution, the program code comprising instructions for performing the pedestrian re-identification method according to claim 5.

13. A chip, characterized in that it comprises a processor and a data interface, the processor reading instructions stored on a memory through the data interface to execute the training method according to any one of claims 1 to 4.

14. A chip, characterized in that the chip comprises a processor and a data interface, the processor reads instructions stored on a memory through the data interface to execute the pedestrian re-identification method according to claim 5.