CN110942012A

CN110942012A - Image feature extraction method, pedestrian re-identification method, device and computer equipment

Info

Publication number: CN110942012A
Application number: CN201911156432.0A
Authority: CN
Inventors: 周康明; 戚风亮
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-03-31

Abstract

The application relates to an image feature extraction method, a pedestrian re-identification device and computer equipment. The image feature extraction method comprises the steps of constructing a basic module for adaptively selecting a convolution kernel shape, and obtaining a deep learning feature extraction network model based on a plurality of basic modules which are connected in series; training the deep learning feature extraction network model by adopting a forward propagation algorithm through a pedestrian sample image set until model parameters are converged; and scanning the input image by adopting the trained deep learning feature extraction network model, so that the input image sequentially passes through a plurality of basic modules which are connected in series, and the feature vector with strong discriminability of the input image is obtained after feature mapping.

Description

Image feature extraction method, pedestrian re-identification method, device and computer equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image feature extraction method, a pedestrian re-identification method, an apparatus, and a computer device.

Background

With the development of artificial intelligence technology, more and more tedious work is replaced by machines, for example, Re-id (pedestrian retry) task is used as an important branch in computer vision, and the Re-id task is widely applied to the fields of smart cities, smart traffic and the like. In such tasks, a moving target may appear in different cameras one after another, and the target may be a person or a vehicle, and usually a computer vision related method is used to determine whether two targets in different cameras belong to the same target.

Deep learning is an important branch in the field of artificial intelligence, and has been greatly successful in the fields of image and voice recognition and the like. Among them, neural networks have been widely used in universities and enterprises as important tools in deep learning. By far, neural networks mainly comprise two types: convolutional neural networks and cyclic (recurrent) neural networks, the former being mainly used for image recognition and the latter being mainly used in the speech domain.

Therefore, applying deep learning to the Re-id task is a mainstream method at present, and the Re-id task is generally implemented by extracting the discriminant features in the picture, and the conventional methods for extracting the discriminant features mainly include the following methods: (1) feature extraction is realized through a conventional neural network structure, including ResNet, GoogleNet and the like; (2) by equally dividing an input picture, and independently extracting characteristics with discriminability for each part; (3) and designing a new network structure comprising a full-scale neural network to realize multi-scale feature output under single-layer convolution. However, in the process of extracting the features, most of the methods adopt a conventional network structure, and in the process of designing the network, the characteristics of the picture are ignored, so that the judgment of the extracted features is not strong, and the identification precision of the Re-id is influenced.

Disclosure of Invention

In view of the above, it is necessary to provide an image feature extraction method, a pedestrian re-identification method, an apparatus, and a computer device capable of improving feature discrimination, in order to solve the problem of poor feature discrimination of the above conventional extraction.

In order to achieve the above object, in one aspect, an embodiment of the present application provides an image feature extraction method, where the method includes:

constructing a basic module for adaptively selecting a convolution kernel shape, and obtaining a deep learning feature extraction network model based on a plurality of basic modules connected in series;

training a deep learning feature extraction network model by adopting a forward propagation algorithm based on a pedestrian sample image set until model parameters are converged;

and scanning the input image by adopting the trained deep learning feature extraction network model, wherein the input image sequentially passes through a plurality of basic modules which are connected in series, and obtaining a feature vector of the input image after feature mapping.

In a second aspect, an embodiment of the present application provides a pedestrian re-identification method, where the method includes:

acquiring a target image and an image set to be retrieved, wherein the target image comprises target pedestrians;

extracting a first characteristic vector of the target image and a second characteristic vector of each image to be retrieved in the image set to be retrieved according to the image characteristic extraction method;

calculating the Euclidean distance between the first characteristic vector of the target image and the second characteristic vector of each image to be retrieved in the image set to be retrieved;

and generating a sequencing result of the images to be retrieved in the image set to be retrieved based on the Euclidean distance, and taking the sequencing result as a pedestrian re-identification result.

In a third aspect, an embodiment of the present application provides an image feature extraction apparatus, including:

the model construction module is used for constructing a basic module for adaptively selecting the convolution kernel shape and obtaining a deep learning feature extraction network model based on a plurality of basic modules which are connected in series;

the model training module is used for training a deep learning feature extraction network model by adopting a forward propagation algorithm based on a pedestrian sample image set until model parameters are converged;

and the feature extraction module is used for scanning the input image by adopting the trained deep learning feature extraction network model, wherein the input image sequentially passes through the plurality of basic modules which are connected in series, and the feature vector of the input image is obtained after feature mapping.

In a fourth aspect, an embodiment of the present application provides a pedestrian re-identification apparatus, including:

the image acquisition module is used for acquiring a target image and an image set to be retrieved, wherein the target image comprises target pedestrians;

the feature vector extraction module is used for extracting a first feature vector of the target image and a second feature vector of each image to be retrieved in the image set to be retrieved by the image feature extraction method;

the distance calculation module is used for calculating the Euclidean distance between the first characteristic vector of the target image and the second characteristic vector of each image to be retrieved in the image set to be retrieved;

and the recognition result generation module is used for generating a sequencing result of the images to be retrieved in the image set to be retrieved based on the Euclidean distance, and taking the sequencing result as a pedestrian re-recognition result.

In a fifth aspect, the present application provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method as described above.

According to the image feature extraction method, the pedestrian re-identification method, the device and the computer equipment, the basic module of the convolution kernel shape is selected in a self-adaptive mode, and a deep learning feature extraction network model is obtained on the basis of the plurality of basic modules which are connected in series; training the deep learning feature extraction network model by adopting a forward propagation algorithm through a pedestrian sample image set until model parameters are converged; and scanning an input image by adopting the trained deep learning feature extraction network model, and performing feature mapping on the input image sequentially through a plurality of basic modules connected in series to obtain a feature vector with strong discriminability of the input image.

Drawings

FIG. 1 is a diagram illustrating an exemplary embodiment of an application of a method for extracting image features;

FIG. 2 is a schematic flow chart diagram of a method for image feature extraction in one embodiment;

FIG. 3 is a schematic flow chart of the steps of building basic modules in one embodiment;

FIG. 4 is a schematic diagram of the basic module of one embodiment;

FIG. 5 is a diagram illustrating a structure of a feature extraction network model in one embodiment;

FIG. 6 is a schematic flow chart diagram illustrating the model training steps in one embodiment;

FIG. 7 is a flow diagram illustrating a method for pedestrian re-identification in one embodiment;

FIG. 8 is a block diagram showing the structure of an image feature extraction device according to an embodiment;

FIG. 9 is a block diagram showing the construction of a pedestrian re-identification apparatus in one embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The image feature extraction method provided by the application can be applied to the application environment shown in fig. 1. In this embodiment, the terminal 102 may be various devices having image capturing and storing functions, such as but not limited to various smart phones, tablet computers, cameras, and portable image capturing devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. Specifically, the terminal 102 is configured to acquire an input image to be subjected to feature extraction and a pedestrian sample image set subjected to model training, and send the acquired input image and the pedestrian sample image set subjected to model training to the server 104 through a network, where of course, the input image and the pedestrian sample image set subjected to model training may also be stored in the server 104 in advance. The server 104 constructs a basic module for adaptively selecting the convolution kernel shape, and obtains a deep learning feature extraction network model based on a plurality of basic modules connected in series; training the deep learning feature extraction network model by adopting a forward propagation algorithm through a pedestrian sample image set until model parameters are converged; and scanning an input image by adopting the trained deep learning feature extraction network model, and performing feature mapping on the input image sequentially through a plurality of basic modules connected in series to obtain a feature vector with strong discriminability of the input image.

In one embodiment, as shown in fig. 2, an image feature extraction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, constructing a basic module for adaptively selecting the convolution kernel shape, and obtaining a deep learning feature extraction network model based on a plurality of basic modules connected in series.

The basic module for adaptively selecting the convolution kernel shape is based on the prior knowledge of the pedestrian feature distribution in the image, and the prior knowledge is integrated into the design of the basic module, so that the basic module can adaptively select a convolution kernel shape suitable for the basic module to realize feature extraction.

Since in the field of the pedestrian Re-id, the pedestrian in the image is generally in an upright state, this means that the local features in the image are distributed in three ways as a whole: vertical distribution, such as the legs, arms, etc. of a pedestrian; horizontal distribution, such as a certain type of stripe on the clothing of a pedestrian; evenly distributed, such as the head or feet of a pedestrian. According to the priori knowledge, the features of the images can be extracted by selecting convolution kernels with different widths and high proportions, so that the extracted features are more discriminative. Therefore, according to the priori knowledge, basic modules with convolution kernels with different widths and proportions can be designed to convolve the picture, and weights of the convolution kernels with various shapes in the convolution process are learned in a self-adaptive mode, so that more discriminant feature extraction is achieved. Meanwhile, in order to extract high-level features, a plurality of basic modules can be sequentially stacked in series, so that a deep learning feature extraction network model is obtained, and finally extracted features are stronger in discriminability.

And step 204, training a deep learning feature extraction network model by adopting a forward propagation algorithm based on the pedestrian sample image set until the model parameters are converged.

The pedestrian sample image set is sample data which is subjected to preprocessing, and the preprocessing comprises labeling (for example, each pedestrian sample image is labeled with the identification of a target pedestrian, such as the number of the target pedestrian) and processing (for example, size adjustment, mean value removal and the like) on the image. Specifically, after the deep learning feature extraction network model is obtained through the steps, the deep learning feature extraction network model is trained through the preprocessed sample data and the forward propagation algorithm until the loss function value is not reduced or the iteration times reach the preset times, so that the model can learn the pedestrian features with discriminability through the sample data.

And step 206, scanning the input image by adopting the trained deep learning feature extraction network model to obtain the feature vector of the input image.

The input image refers to a preprocessed image to be subjected to feature extraction, and specifically, the preprocessing of the input image includes, but is not limited to, adjusting the image to a preset size, performing channel averaging processing on the image, and the like. In this embodiment, since the deep-learning feature extraction network model is obtained by connecting a plurality of basic modules in series, the input image sequentially passes through the plurality of basic modules in series, and the feature vector with strong discriminability of the input image is obtained after feature mapping. The input of the first basic module in the plurality of basic modules connected in series is an input image, the input of the basic module is the output of the previous basic module from the second basic module, the output of the basic module is the input of the next basic module, and the output of the last basic module is the feature vector of the corresponding input image after the pooling operation.

The image feature extraction method comprises the steps of constructing a basic module for adaptively selecting the shape of a convolution kernel, and obtaining a deep learning feature extraction network model based on a plurality of basic modules which are connected in series; training the deep learning feature extraction network model by adopting a forward propagation algorithm through a pedestrian sample image set until model parameters are converged; and scanning an input image by adopting the trained deep learning feature extraction network model, and performing feature mapping on the input image sequentially through a plurality of basic modules connected in series to obtain a feature vector with strong discriminability of the input image.

In one embodiment, as shown in fig. 3, the building of the basic module for adaptively selecting the convolution kernel shape may specifically include the following steps:

step 302, a first convolution neural network is established according to a plurality of preset convolution kernel shapes.

The preset convolution kernel shapes can be preset convolution kernels with different aspect ratios according to the priori knowledge. Specifically, in the Re-id field, pedestrians in an image generally stand upright, so that local features in the image are distributed in three ways as a whole: vertical distribution, horizontal distribution and uniform distribution. The human legs are vertically distributed in an upright state, so that a convolution kernel distributed along the longitudinal direction can be designed to realize feature extraction, for example, the convolution kernel with the aspect ratio of 1:9, and the like, the human head or the human feet are uniformly distributed, so that the feature extraction can be carried out through the convolution kernel with the aspect ratio of 3:3, and the feature extraction can be carried out through the convolution kernel with the aspect ratio of 9:1 for horizontally distributed stripes on clothes of the human, and the extracted features are more discriminative.

Therefore, in the present embodiment, a plurality of convolution kernels with different aspect ratios are set to establish a first convolution neural network, and the first convolution neural network is used for extracting, from input data, first feature maps of a plurality of branches respectively corresponding to a plurality of preset convolution kernel shapes. As shown in fig. 4, the convolution kernel corresponding to each aspect ratio is a branch, for example, the aspect ratio of each branch convolution kernel may be 1 to 9, 3 to 3, and 9 to 1, respectively, which may be specifically adjusted or more branches may be set according to actual needs, which is not limited in this application. And sequentially accessing a Batch Normalization (BN) layer and an activation layer A after the convolution kernel of each branch to obtain a first convolution neural network 01 so as to increase the nonlinear expression capability and the training stability of the model. Specifically, the input data passes through the layer to obtain 3 first feature maps F11, F12, and F13 corresponding to the branches.

Step 304, accessing the first fusion layer after the first convolutional neural network.

Specifically, in order to enable the network to adaptively learn the weights of convolution kernels of various shapes in the convolution process, therefore, a set of weights needs to be generated for each spatial position in the first feature maps F11, F12 and F13, and in order to obtain the weights of all spatial positions, the first feature maps F11, F12 and F13 need to be spliced. In the present embodiment, as shown in fig. 4, the first characteristic patterns F11, F12, and F13 are spliced in the channel direction by the first fusion layer C, thereby obtaining a spliced second characteristic pattern F2.

Step 306, accessing a second convolutional neural network after the first fusion layer.

Wherein the second convolutional neural network may employ a convolutional kernel of size 1 × 1, the second convolutional neural network being used to generate branch weights for each spatial location in the second feature map F2. Specifically, each spatial position in the second feature map F2 is scanned by a convolution kernel of 1 × 1 to generate branch weights corresponding to a plurality of branches respectively at each spatial position, wherein the sum of the branch weights of the plurality of branches at the same spatial position is 1. In this embodiment, taking fig. 4 as an example, since the output branch of the first convolutional neural network 01 is 3, the branch weights of the corresponding branches include the weights of the 3 branches, and the weight of one branch is the probability that a spatial position is suitable for the branch convolution kernel. After the convolution kernel of 1 × 1, the classification network a and the normalization layer S (softmax normalization) are sequentially accessed to obtain a second convolution neural network 02, after each spatial position in the second feature map F2 is scanned and classified by using the convolution kernel of 1 × 1, a feature map with the branch number of 3 is output, and softmax normalization operation is performed on the feature map along the channel direction to obtain a final weight feature map F3.

And step 308, accessing the branch splitting layer after the second convolutional neural network.

The branch splitting layer D is used for splitting the weight characteristic graph according to the multiple branches. Specifically, based on the weight feature map F3 obtained in the above step, a splitting operation is performed, that is, branch weight values corresponding to a plurality of branches one to one are obtained.

Step 310, accessing the weighting calculation layer after the branch splitting layer.

Specifically, the weighting calculation layer is configured to calculate a plurality of weighted third feature maps F41, F42, and F43 according to branch weight values corresponding to the plurality of branches one to one and the first feature maps of the plurality of branches. Specifically, the weight value of each branch is multiplied by the first feature map F11, F12, or F13 of the corresponding branch, so as to obtain the correspondingly weighted third feature maps F41, F42, and F43.

And step 312, sequentially accessing the second fusion layer, the residual connecting layer and the activation layer after the weighting calculation layer to obtain a basic module for adaptively selecting the convolution kernel shape.

Specifically, based on the third feature maps F41, F42, and F43, feature fusion is performed through the second fusion layer to obtain a fourth feature map F5, then the residual connection layer performs residual connection between the fourth feature map F5 and the input data to obtain an Output1 of the basic module, and then the semantic feature of the input data is obtained after the activation layer a, where the semantic feature is the Output of the basic module.

In the basic module for constructing the self-adaptive selection convolution kernel shape, the first convolution neural network is established through a plurality of preset convolution kernel shapes so as to perform multi-branch feature extraction on input data, and the branch weight of each spatial position in the spliced second feature map F2 is generated through the second convolution neural network, so that the network can self-adaptively learn the weights of the convolution kernels in various shapes in the convolution process, and therefore the convolution kernels in proper shapes can be self-adaptively selected for feature extraction, and the discriminability of extracted features is improved.

In one embodiment, since the feature extraction network model for deep learning of feature extraction is composed of a plurality of basic modules connected in series, the feature vector of the input image is obtained after the input image is subjected to feature mapping sequentially through the plurality of basic modules connected in series. As shown in fig. 5, the input of the first basic module in the multiple basic modules connected in series is the input image, starting from the second basic module, the input data of the basic module is the output of the previous basic module, the semantic features output by the basic module are the input data of the next basic module, and the semantic features output by the last basic module are the feature vector of the input image after the pooling operation. It should be noted that the number of the basic modules connected in series can be determined according to factors such as an actual service scene and data volume, and generally, the larger the number of the basic modules connected in series is, the stronger the abstraction of the finally obtained feature vector is, and the extracted features are more discriminative.

In an embodiment, as shown in fig. 6, training a deep-learning feature extraction network model by using a forward propagation algorithm based on a pedestrian sample image set may specifically include the following steps:

step 602, inputting the pedestrian sample image set into a deep learning feature extraction network model for forward propagation.

The pedestrian sample image set is provided with a sample label marked with a target pedestrian identification. Specifically, the pedestrian sample image set is sample data which is preprocessed, and the preprocessing includes labeling a sample label, for example, each pedestrian sample image is labeled with an identifier of a target pedestrian, for example, a serial number of the target pedestrian; and carrying out size adjustment, mean value removal processing and the like on the sample image. For example, a sample image in the sample image set may be represented by a tuple (x _ i, p _ i), where i represents the ith training data, x represents the sample image itself, and p represents a number to which a target object in the sample image belongs.

And step 604, training the deep learning feature extraction network model by using the cross entropy loss function.

And 606, optimizing the cross entropy loss function by a gradient descent method until a converged model parameter is obtained when the cross entropy loss function reaches a minimum value.

Specifically, in this embodiment, a pedestrian sample image set is input to a deep learning feature extraction network model to obtain a feature vector of a sample image, a loss value is calculated according to a cross entropy loss function through forward propagation, and then the cross entropy loss function is optimized according to a gradient descent method, that is, the above-described process is performed by iterating a plurality of batches of pedestrian sample image sets, and model parameters are updated. And terminating the training when the loss value is not reduced any more or the iteration times reach the preset times, converging the model parameters at the moment, and storing the model parameters so as to obtain the trained deep learning feature extraction network model.

In one embodiment, the present application further provides a pedestrian re-identification method, as shown in fig. 7, including the following steps:

step 702, a target image and an image set to be retrieved are obtained.

The target image refers to an image including a target object, and in this embodiment, the target object may be a target pedestrian. The image set to be retrieved needs to perform a Re-id task to find whether an image set identical to the target object in the target image exists. Specifically, when the Re-id task is executed, a target image and an image set to be retrieved need to be acquired first.

Step 704, extracting the first feature vector of the target image and the second feature vector of each image to be retrieved in the image set to be retrieved according to the image feature extraction method.

Specifically, in order to obtain more discriminative features and improve the recognition accuracy of Re-id, in this embodiment, the first feature vector of the target image and the second feature vector of each image to be retrieved in the image set to be retrieved are extracted by the image feature extraction method, and then Re-id recognition is performed according to the extracted first feature vector and second feature vector.

Step 706, calculating the Euclidean distance between the first feature vector of the target image and the second feature vector of each image to be retrieved in the image set to be retrieved.

The similarity between the vectors can be used for measuring the similarity between two images, and the similarity between the vectors can be measured based on the distance, such as Euclidean distance. Therefore, in the embodiment, the degree of similarity between the target image and each image to be retrieved in the image set to be retrieved is measured by using the euclidean distance, that is, the closer the euclidean distance is, the higher the degree of similarity between the target image and each image to be retrieved is, and the greater the probability of belonging to the same person is. Specifically, the similarity between the target image and each image to be retrieved in the image set to be retrieved is obtained by calculating the euclidean distance between the first feature vector of the target image and the second feature vector of each image to be retrieved in the image set to be retrieved.

And 708, generating a sequencing result of the images to be retrieved in the image set to be retrieved based on the Euclidean distance, and taking the sequencing result as a pedestrian re-identification result.

Specifically, the euclidean distance between the first feature vector of the target image and the second feature vector of each image to be retrieved in the image set to be retrieved, which is calculated in the above steps, is used to obtain the degree of similarity between the target image and each image to be retrieved in the image set to be retrieved, and then the images to be retrieved in the image set to be retrieved are sorted according to the degree of similarity.

According to the pedestrian Re-identification method, the first characteristic vector of the target image and the second characteristic vector of each image to be retrieved in the image set to be retrieved are extracted by adopting the image characteristic extraction method, the Euclidean distance between the first characteristic vector and the second characteristic vector of each image to be retrieved in the image set to be retrieved is further calculated, the sequencing result of the images to be retrieved in the image set to be retrieved is generated based on the Euclidean distance, the sequencing result is used as the pedestrian Re-identification result, the characteristic extraction discriminability is improved, and the Re-id identification precision is improved.

In one embodiment, if there is one query picture set q, and one queried picture set g. And (3) aiming at each picture in the set q, carrying out first feature vector extraction by the image feature extraction method, and simultaneously carrying out second feature vector extraction on all pictures in the set g. Based on all the feature vectors, the euclidean distances between all the first feature vectors in the set q and all the second feature vectors in the set g are calculated and sorted based on the distances, for example, for a certain picture q1 in the set q, the more similar pictures in the set g are sorted the earlier, i.e. the probability of belonging to the same person is greater.

It should be understood that although the various steps in the flow charts of fig. 1-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-7 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided an image feature extraction device including: a model building module 801, a model training module 802, and a feature extraction module 803, wherein:

the model construction module 801 is used for constructing a basic module for adaptively selecting a convolution kernel shape, and obtaining a deep learning feature extraction network model based on a plurality of basic modules connected in series;

the model training module 802 is configured to train a deep learning feature extraction network model by using a forward propagation algorithm based on a pedestrian sample image set until model parameters converge;

and the feature extraction module 803 is configured to scan an input image by using the trained deep learning feature extraction network model, where the input image sequentially passes through a plurality of basic modules connected in series, and performs feature mapping to obtain a feature vector of the input image.

In one embodiment, model building module 801 is specifically configured to: establishing a first convolution neural network according to a plurality of preset convolution kernel shapes, wherein the first convolution neural network is used for extracting first feature maps of a plurality of branches respectively corresponding to the plurality of preset convolution kernel shapes from input data; accessing a first fusion layer after the first convolution neural network, and performing feature fusion on the first feature maps of the multiple branches through the first fusion layer to obtain a second feature map; a second convolutional neural network is accessed after the first fusion layer, and branch weights of each space position in the second feature map are generated by using the second convolutional neural network to obtain a weight feature map; accessing a branch splitting layer behind the second convolutional neural network, and splitting the weight characteristic graph according to the multiple branches to obtain branch weight values corresponding to the multiple branches one by one; accessing a weighting calculation layer after the branch splitting layer, and calculating to obtain a plurality of weighted third feature maps according to branch weight values corresponding to the plurality of branches one to one and the first feature maps of the plurality of branches; and sequentially accessing a second fusion layer, a residual connection layer and an activation layer after the weighting calculation layer to obtain a basic module of a self-adaptive selection convolution kernel shape, performing feature fusion on a plurality of third feature graphs through the second fusion layer to obtain a fourth feature graph, performing residual connection on the fourth feature graph and input data through the residual connection layer, and obtaining semantic features of the input data through the activation layer, wherein the semantic features are output by the basic module.

In one embodiment, the first convolutional neural network includes a preset plurality of convolutional kernels with different aspect ratios, wherein the aspect ratios of the convolutional kernels with different aspect ratios are 1 to 9, 3 to 3 and 9 to 1 respectively.

In one embodiment, the second convolutional neural network employs a convolution kernel of size 1 x 1; generating a branch weight for each spatial location in the second feature map using a second convolutional neural network, comprising: and scanning each space position in the second characteristic diagram through the convolution kernel of 1 x 1 to generate branch weights of each space position corresponding to the plurality of branches respectively, wherein the sum of the branch weights of the plurality of branches is 1.

In one embodiment, the obtaining of the feature vector of the input image after the input image is subjected to feature mapping sequentially through a plurality of basic modules connected in series includes: the input of the first basic module in the plurality of basic modules connected in series is an input image, from the second basic module, the input data of the basic module is the output of the previous basic module, the semantic features output by the basic module are the input data of the next basic module, and the semantic features output by the last basic module are subjected to pooling operation to form the feature vector of the input image.

In one embodiment, model training module 802 is specifically configured to: inputting a pedestrian sample image set into a deep learning feature extraction network model for forward propagation, wherein the pedestrian sample image set is provided with a sample label marked with a target pedestrian identifier; training a deep learning feature extraction network model by using a cross entropy loss function; and optimizing the cross entropy loss function by a gradient descent method until the cross entropy loss function reaches the minimum value to obtain a convergent model parameter.

For specific definition of the image feature extraction device, reference may be made to the above definition of the image feature extraction method, which is not described herein again. The modules in the image feature extraction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 9, there is provided a pedestrian re-recognition apparatus including: an image obtaining module 901, a feature vector extracting module 902, a distance calculating module 903 and a recognition result generating module 904, wherein:

an image obtaining module 901, configured to obtain a target image and an image set to be retrieved, where the target image includes a target pedestrian;

a feature vector extraction module 902, configured to extract a first feature vector of a target image and a second feature vector of each image to be retrieved in an image set to be retrieved according to the above-mentioned image feature extraction method;

a distance calculating module 903, configured to calculate a euclidean distance between a first feature vector of a target image and a second feature vector of each image to be retrieved in an image set to be retrieved;

and the identification result generating module 904 is configured to generate a ranking result of the images to be retrieved in the image set to be retrieved based on the euclidean distance, and use the ranking result as a pedestrian re-identification result.

For specific definition of the pedestrian re-identification device, reference may be made to the above definition of the pedestrian re-identification method, and details are not repeated here. The modules in the pedestrian re-identification device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing a pedestrian sample image set for model training and image data to be subjected to feature extraction or Re-id identification. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image feature extraction method or a pedestrian re-identification method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of: establishing a first convolution neural network according to a plurality of preset convolution kernel shapes, wherein the first convolution neural network is used for extracting first feature maps of a plurality of branches respectively corresponding to the plurality of preset convolution kernel shapes from input data; accessing a first fusion layer after the first convolution neural network, and performing feature fusion on the first feature maps of the multiple branches through the first fusion layer to obtain a second feature map; a second convolutional neural network is accessed after the first fusion layer, and branch weights of each space position in the second feature map are generated by using the second convolutional neural network to obtain a weight feature map; accessing a branch splitting layer behind the second convolutional neural network, and splitting the weight characteristic graph according to the multiple branches to obtain branch weight values corresponding to the multiple branches one by one; accessing a weighting calculation layer after the branch splitting layer, and calculating to obtain a plurality of weighted third feature maps according to branch weight values corresponding to the plurality of branches one to one and the first feature maps of the plurality of branches; and sequentially accessing a second fusion layer, a residual connection layer and an activation layer after the weighting calculation layer to obtain a basic module of a self-adaptive selection convolution kernel shape, performing feature fusion on a plurality of third feature graphs through the second fusion layer to obtain a fourth feature graph, performing residual connection on the fourth feature graph and input data through the residual connection layer, and obtaining semantic features of the input data through the activation layer, wherein the semantic features are output by the basic module.

In one embodiment, the second convolutional neural network employs a convolution kernel of size 1 x 1; the processor when executing the computer program further realizes the following steps: and scanning each space position in the second characteristic diagram through the convolution kernel of 1 x 1 to generate branch weights of each space position corresponding to the plurality of branches respectively, wherein the sum of the branch weights of the plurality of branches is 1.

In one embodiment, the processor, when executing the computer program, further performs the steps of: the input of the first basic module in the plurality of basic modules connected in series is an input image, from the second basic module, the input data of the basic module is the output of the previous basic module, the semantic features output by the basic module are the input data of the next basic module, and the semantic features output by the last basic module are subjected to pooling operation to form the feature vector of the input image.

In one embodiment, the processor, when executing the computer program, further performs the steps of: inputting a pedestrian sample image set into a deep learning feature extraction network model for forward propagation, wherein the pedestrian sample image set is provided with a sample label marked with a target pedestrian identifier; training a deep learning feature extraction network model by using a cross entropy loss function; and optimizing the cross entropy loss function by a gradient descent method until the cross entropy loss function reaches the minimum value to obtain a convergent model parameter.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of: establishing a first convolution neural network according to a plurality of preset convolution kernel shapes, wherein the first convolution neural network is used for extracting first feature maps of a plurality of branches respectively corresponding to the plurality of preset convolution kernel shapes from input data; accessing a first fusion layer after the first convolution neural network, and performing feature fusion on the first feature maps of the multiple branches through the first fusion layer to obtain a second feature map; a second convolutional neural network is accessed after the first fusion layer, and branch weights of each space position in the second feature map are generated by using the second convolutional neural network to obtain a weight feature map; accessing a branch splitting layer behind the second convolutional neural network, and splitting the weight characteristic graph according to the multiple branches to obtain branch weight values corresponding to the multiple branches one by one; accessing a weighting calculation layer after the branch splitting layer, and calculating to obtain a plurality of weighted third feature maps according to branch weight values corresponding to the plurality of branches one to one and the first feature maps of the plurality of branches; and sequentially accessing a second fusion layer, a residual connection layer and an activation layer after the weighting calculation layer to obtain a basic module of a self-adaptive selection convolution kernel shape, performing feature fusion on a plurality of third feature graphs through the second fusion layer to obtain a fourth feature graph, performing residual connection on the fourth feature graph and input data through the residual connection layer, and obtaining semantic features of the input data through the activation layer, wherein the semantic features are output by the basic module.

In one embodiment, the second convolutional neural network employs a convolution kernel of size 1 x 1; the computer program when executed by the processor further realizes the steps of: and scanning each space position in the second characteristic diagram through the convolution kernel of 1 x 1 to generate branch weights of each space position corresponding to the plurality of branches respectively, wherein the sum of the branch weights of the plurality of branches is 1.

In one embodiment, the computer program when executed by the processor further performs the steps of: the input of the first basic module in the plurality of basic modules connected in series is an input image, from the second basic module, the input data of the basic module is the output of the previous basic module, the semantic features output by the basic module are the input data of the next basic module, and the semantic features output by the last basic module are subjected to pooling operation to form the feature vector of the input image.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting a pedestrian sample image set into a deep learning feature extraction network model for forward propagation, wherein the pedestrian sample image set is provided with a sample label marked with a target pedestrian identifier; training a deep learning feature extraction network model by using a cross entropy loss function; and optimizing the cross entropy loss function by a gradient descent method until the cross entropy loss function reaches the minimum value to obtain a convergent model parameter.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image feature extraction method, characterized in that the method comprises:

training the deep learning feature extraction network model by adopting a forward propagation algorithm based on a pedestrian sample image set until model parameters are converged;

scanning an input image by adopting a trained deep learning feature extraction network model, and performing feature mapping on the input image sequentially through a plurality of basic modules connected in series to obtain a feature vector of the input image.

2. The image feature extraction method according to claim 1, wherein the building of the basic module for adaptively selecting the shape of the convolution kernel comprises:

establishing a first convolution neural network according to a plurality of preset convolution kernel shapes, wherein the first convolution neural network is used for extracting first feature maps of a plurality of branches respectively corresponding to the plurality of preset convolution kernel shapes from input data;

accessing a first fusion layer after the first convolutional neural network, and performing feature fusion on the first feature maps of the plurality of branches through the first fusion layer to obtain a second feature map;

accessing a second convolutional neural network after the first fusion layer, and generating a branch weight for each spatial position in the second feature map by using the second convolutional neural network to obtain a weight feature map;

accessing a branch splitting layer after the second convolutional neural network, and splitting the weight characteristic graph according to the multiple branches to obtain branch weight values corresponding to the multiple branches one to one;

accessing a weighting calculation layer after the branch splitting layer, and calculating to obtain a plurality of weighted third feature maps according to the branch weight values corresponding to the plurality of branches one to one and the first feature maps of the plurality of branches;

and sequentially accessing a second fusion layer, a residual connection layer and an activation layer behind the weighting calculation layer to obtain a basic module of a self-adaptive selection convolution kernel shape, performing feature fusion on the plurality of third feature maps through the second fusion layer to obtain a fourth feature map, performing residual connection on the fourth feature map and the input data through the residual connection layer, and obtaining semantic features of the input data after passing through the activation layer, wherein the semantic features are output by the basic module.

3. The image feature extraction method according to claim 2, wherein the first convolutional neural network includes convolutional kernels of a plurality of preset different aspect ratios.

4. The image feature extraction method of claim 2, wherein the second convolutional neural network employs a convolutional kernel of size 1 x 1; the generating, with the second convolutional neural network, a branch weight for each spatial location in the second feature map comprises:

and scanning each space position in the second characteristic diagram through a convolution kernel of 1 x 1 to generate branch weights of each space position corresponding to a plurality of branches respectively, wherein the sum of the branch weights of the plurality of branches is 1.

5. The image feature extraction method according to claim 2, wherein the obtaining of the feature vector of the input image after the feature mapping of the input image sequentially through a plurality of basic modules connected in series comprises:

the input of the first basic module in the plurality of basic modules connected in series is the input image, starting from the second basic module, the input data of the basic module is the output of the previous basic module, the semantic features output by the basic module are the input data of the next basic module, and the semantic features output by the last basic module are the feature vector of the input image after the pooling operation.

6. The image feature extraction method according to any one of claims 1 to 5, wherein the training of the deep-learning feature extraction network model by using a forward propagation algorithm based on the pedestrian sample image set comprises:

inputting the pedestrian sample image set into the deep learning feature extraction network model for forward propagation, wherein the pedestrian sample image set is provided with a sample label marked with a pedestrian identifier;

training the deep learning feature extraction network model by using a cross entropy loss function;

and optimizing the cross entropy loss function by a gradient descent method until a convergent model parameter is obtained when the cross entropy loss function reaches a minimum value.

7. A pedestrian re-identification method, the method comprising:

the image feature extraction method according to any one of claims 1 to 6, extracting a first feature vector of the target image and a second feature vector of each image to be retrieved in the image set to be retrieved;

and generating a sequencing result of the images to be retrieved in the image set to be retrieved based on the Euclidean distance, wherein the sequencing result is a pedestrian re-identification result.

8. An image feature extraction device characterized by comprising:

the model training module is used for training the deep learning feature extraction network model by adopting a forward propagation algorithm based on a pedestrian sample image set until model parameters are converged;

and the feature extraction module is used for scanning an input image by adopting the trained deep learning feature extraction network model, and the input image sequentially passes through the plurality of basic modules connected in series to perform feature mapping to obtain a feature vector of the input image.

9. A pedestrian re-identification apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a target image and an image set to be retrieved, wherein the target image comprises a target pedestrian;

a feature vector extraction module, configured to extract a first feature vector of the target image and a second feature vector of each image to be retrieved in the image set to be retrieved according to the image feature extraction method according to any one of claims 1 to 6;

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.