CN109165563B

CN109165563B - Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product

Info

Publication number: CN109165563B
Application number: CN201810848366.2A
Authority: CN
Inventors: 陈大鹏; 李鸿升; 刘希慧; 邵静; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2021-03-23
Anticipated expiration: 2038-07-27
Also published as: CN109165563A

Abstract

The embodiment of the application discloses a pedestrian re-identification method and device, electronic equipment, a storage medium and a program product, and the method comprises the steps of obtaining an image to be identified and a candidate image set; performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained by image feature and language description cross-modal training; the method has the advantages that the corresponding recognition result of the image to be recognized is obtained from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature, the natural corresponding relation between the image and the language describing the image is utilized, the correlation between the local image area and the noun phrase is further mined in a phrase reconstruction mode, the constraint on the learning of the image feature is enhanced, the quality of the visual feature of the pedestrian re-recognition is improved, and the accuracy of the pedestrian re-recognition is further improved.

Description

Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product

Technical Field

The present application relates to computer vision technologies, and in particular, to a pedestrian re-identification method and apparatus, an electronic device, a storage medium, and a program product.

Background

Pedestrian re-identification is a key technology in an intelligent video monitoring system, and aims to find out a target sample from a large number of post-selected samples by measuring the similarity between a given target sample and the post-selected samples. With the application of deep neural networks, the visual features for pedestrian re-identification are enhanced. In order to further improve the discrimination capability of the features, some methods use auxiliary data; the following problems still remain: the cost of running price and time of the algorithm is increased by depending on additional equipment or models; or a complex labeling format is defined for the auxiliary data, so that the labor cost of data labeling is increased.

Disclosure of Invention

The embodiment of the application provides a pedestrian re-identification technology.

According to an aspect of an embodiment of the present application, there is provided a pedestrian re-identification method, including:

acquiring an image to be identified and a candidate image set;

performing feature extraction on an image to be recognized and each candidate image in a candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained by image feature and language description cross-modal training;

and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature, wherein the recognition result comprises at least one candidate image.

Optionally, the obtaining, from the candidate image set, a recognition result corresponding to the image to be recognized based on the intermediate feature to be recognized and the candidate intermediate feature includes:

the intermediate features to be identified and the candidate intermediate features respectively pass through an average pooling layer and a full connection layer to obtain the features to be identified and the candidate features;

and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the feature to be recognized and the candidate feature.

Optionally, the method further comprises: extracting the features of the description characters related to the image to be recognized based on a language recognition network to obtain language features;

and screening the recognition result based on the language features to obtain an updated recognition result corresponding to the image to be recognized, wherein the updated recognition result comprises at least one candidate image.

Optionally, the screening the recognition result based on the language feature to obtain an updated recognition result corresponding to the image to be recognized includes:

screening based on the distance between the language feature and at least one candidate intermediate feature corresponding to the recognition result;

and obtaining at least one candidate intermediate feature with the distance smaller than or equal to a preset value, and taking the candidate image corresponding to the obtained candidate intermediate feature as the updating and identifying result.

Optionally, the method further comprises:

performing feature extraction on at least one description word related to the image to be recognized based on the language recognition network to obtain word features, wherein each description word corresponds to at least one part in the image to be recognized;

and screening the recognition result or the updated recognition result based on the word characteristics to obtain a target recognition result corresponding to the image to be recognized, wherein the target recognition result comprises at least one candidate image.

Optionally, the screening the recognition result or the updated recognition result based on the word feature to obtain the target recognition result corresponding to the image to be recognized includes:

screening based on the distance between the word feature and at least one candidate intermediate feature corresponding to the recognition result or the updated recognition result;

and obtaining at least one candidate feature with the distance smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as the target identification result.

Optionally, the obtaining a feature extraction network through cross-modality training of image features and language description includes:

inputting a sample image into the feature extraction network to obtain sample image features, wherein the sample image comprises a text description label;

performing feature extraction on the character description label based on a language identification network to obtain a sample language feature;

training the feature extraction network based on the sample language features and the sample image features.

Optionally, the training the feature extraction network based on the sample language features and the sample image features includes:

obtaining a global correlation probability based on the sample language features and the sample image features;

obtaining a global loss by utilizing a binary mutual entropy loss based on the global correlation probability and the correlation between the sample image and the text description label;

training the feature extraction network based on the global loss.

Optionally, the obtaining a global correlation probability based on the sample language feature and the sample image feature includes:

pooling the sample image features and subtracting the sample language features to obtain difference features;

calculating the square value of the difference value characteristic element by element to obtain a combined characteristic;

and performing normalization processing on the combined features to obtain global correlation probability for expressing global correlation.

Optionally, before the extracting features of the text description label based on the language identification network to obtain the sample language features, the method further includes:

pre-training the language identification network based on sample words, wherein the sample words comprise markup language features.

Optionally, the pre-training the language recognition network based on the sample word includes:

inputting the sample characters into the language identification network to obtain a first prediction sample characteristic;

adjusting parameters of the language identification network based on the first predicted sample features and the markup language features.

Optionally, the method further comprises: performing feature extraction on at least one phrase label in the character description label based on the language identification network to obtain at least one local feature, wherein each phrase label is used for describing at least one region in the sample image;

obtaining a local loss based on the local features and the sample image features;

the training the feature extraction network based on the global loss comprises:

training the feature extraction network based on the global loss and the local loss.

Optionally, before performing feature extraction on at least one phrase label in the word description label based on the language identification network to obtain at least one local feature, the method further includes:

segmenting the word description labels to obtain at least one phrase label, wherein each phrase label comprises at least one noun, the obtained phrase labels correspond to label probability, and each probability value represents the probability that the phrase labels correspond to the sample images.

Optionally, the segmenting the word description label to obtain at least one phrase label includes:

performing part-of-speech recognition on each word in the character description label to obtain the part-of-speech corresponding to each word;

and dividing the word description label into at least one phrase label based on the part of speech and a preset phrase blocking condition.

Optionally, the determining a local loss based on the local feature and the sample image feature comprises:

performing pooling operation on the sample image characteristics to obtain a global characteristic map;

obtaining a saliency weight based on the global feature map and the local features;

determining a prediction probability corresponding to each phrase label based on the significance weight and the sample image features;

and obtaining the local loss based on the prediction probability and the labeling probability corresponding to the phrase labeling.

Optionally, the obtaining a saliency weight based on the global feature map and the local feature includes:

respectively subtracting the characteristic value of each position in the global characteristic diagram from the local characteristic to obtain a local difference characteristic;

calculating a square value of each element in the local difference characteristic to obtain a local joint characteristic;

based on the local joint features, a saliency weight is obtained.

Optionally, the obtaining a significance weight based on the local union feature includes:

processing the local combined features based on a full-connection network to obtain a matching value expressing the matching degree of the phrase labels and the sample images;

and normalizing the vector formed by the matching value of each position in the global feature map corresponding to each phrase label to obtain the significance weight corresponding to each phrase label.

Optionally, the determining a prediction probability corresponding to each phrase label based on the saliency weight and the sample image feature includes:

multiplying the feature value of each position in the sample image features by the significance weight to obtain a weighted feature vector set corresponding to each phrase label;

adding the vectors in the weighted feature vector set to obtain local visual features in the sample image corresponding to the phrase labels;

obtaining a prediction probability of each word in the phrase label based on the local visual features;

and determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label.

Optionally, the obtaining a prediction probability of each word in the phrase label based on the local visual features includes:

decomposing the phrase labels into word sequences, inputting the local visual features into a long-term and short-term memory network, and determining at least one hidden variable, wherein each word corresponds to a feature vector;

at each moment, the hidden variable at the previous moment and the feature vector corresponding to the current word act through a long-term and short-term memory network to obtain the hidden variable at the next moment;

performing linear mapping on the basis of the at least one hidden variable to obtain a prediction vector of each word;

and obtaining the prediction probability of each word in the phrase label based on the prediction vector.

Optionally, determining a prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label includes:

and taking the product of the prediction probabilities of the words in the phrase labels as the prediction probability of the phrase labels.

Optionally, the training the feature extraction network based on the global loss and the local loss includes:

summing the global loss and the local loss to obtain a sum loss;

adjusting a parameter of the feature extraction network based on the sum loss.

Optionally, the method further comprises:

inputting an identity sample image into the feature extraction network to obtain a sample prediction feature, wherein the identity sample image comprises an annotation identification feature;

processing the sample prediction characteristics through a pooling layer and a full-connection layer to obtain prediction identification characteristics;

adjusting parameters of the feature extraction network, the pooling layer, and the fully-connected layer based on the annotation identification features and the predicted identification features.

According to another aspect of the embodiments of the present application, there is provided a pedestrian re-recognition apparatus including:

the image acquisition unit is used for acquiring an image to be identified and a candidate image set;

the feature extraction unit is used for extracting features of the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, and the feature extraction network is obtained through image feature and language description cross-modal training;

and the result identification unit is used for obtaining an identification result corresponding to the image to be identified from the candidate image set based on the intermediate feature to be identified and the candidate intermediate feature, and the identification result comprises at least one candidate image.

Optionally, the result identification unit is configured to obtain the to-be-identified feature and the candidate feature by respectively passing through an average pooling layer and a full connection layer by the to-be-identified intermediate feature and the candidate intermediate feature; and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the feature to be recognized and the candidate feature.

Optionally, the method further comprises:

the language screening unit is used for extracting the characteristics of the description characters related to the image to be recognized based on a language recognition network to obtain language characteristics; and screening the recognition result based on the language features to obtain an updated recognition result corresponding to the image to be recognized, wherein the updated recognition result comprises at least one candidate image.

Optionally, the language screening unit is configured to, when the recognition result is screened based on the language feature and an updated recognition result corresponding to the image to be recognized is obtained, screen based on a distance between the language feature and at least one candidate intermediate feature corresponding to the recognition result; and obtaining at least one candidate intermediate feature of which the distance is smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as the updating and identifying result.

Optionally, the method further comprises:

the word screening unit is used for extracting the characteristics of at least one description word related to the image to be recognized based on the language recognition network to obtain word characteristics, and each description word corresponds to at least one part in the image to be recognized; and screening the recognition result or the updated recognition result based on the word characteristics to obtain a target recognition result corresponding to the image to be recognized, wherein the target recognition result comprises at least one candidate image.

Optionally, the word screening unit is configured to screen the recognition result or the updated recognition result based on the word feature, and when a target recognition result corresponding to the image to be recognized is obtained, is configured to screen based on a distance between the word feature and at least one candidate intermediate feature corresponding to the recognition result or the updated recognition result; and obtaining at least one candidate feature of which the distance is smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as the target identification result.

Optionally, the apparatus further comprises:

the sample feature extraction unit is used for inputting a sample image into the feature extraction network to obtain sample image features, wherein the sample image comprises a text description label;

the language feature extraction unit is used for extracting features of the character description labels based on a language identification network to obtain sample language features;

and the network training unit is used for training the feature extraction network based on the sample language features and the sample image features.

Optionally, the network training unit includes:

a global probability module for obtaining a global correlation probability based on the sample language feature and the sample image feature;

the global loss module is used for obtaining global loss by utilizing binary mutual entropy loss based on the global correlation probability and the correlation between the sample image and the text description label;

a loss training module to train the feature extraction network based on the global loss.

Optionally, the global probability module is specifically configured to pool the sample image features and subtract the sample language features to obtain difference features; calculating the square value of the difference value characteristic element by element to obtain a combined characteristic; and performing normalization processing on the combined features to obtain global correlation probability for expressing global correlation.

Optionally, the apparatus further comprises:

and the pre-training unit is used for pre-training the language identification network based on sample characters, and the sample characters comprise marked language features.

Optionally, the pre-training unit is specifically configured to input the sample text into the language identification network to obtain a first predicted sample feature; adjusting parameters of the language identification network based on the first predicted sample features and the markup language features.

Optionally, the network training unit further includes:

the local feature extraction module is used for performing feature extraction on at least one phrase label in the character description labels based on the language identification network to obtain at least one local feature, wherein each phrase label is used for describing at least one region in the sample image;

a local loss module for obtaining a local loss based on the local feature and the sample image feature;

the loss training module is specifically configured to train the feature extraction network based on the global loss and the local loss.

Optionally, the network training unit further includes:

the phrase segmentation module is used for segmenting the word description labels to obtain at least one phrase label, each phrase label comprises at least one noun, the obtained phrase labels correspond to a label probability, and each probability value represents the probability that the phrase labels correspond to the sample images.

Optionally, the phrase segmentation module is specifically configured to perform part-of-speech recognition on each word in the text description label to obtain a part-of-speech corresponding to each word; and dividing the word description label into at least one phrase label based on the part of speech and a preset phrase blocking condition.

Optionally, the local loss module includes:

the pooling module is used for pooling the sample image characteristics to obtain a global characteristic map;

a weighting module for obtaining a saliency weight based on the global feature map and the local features;

a probability prediction module for determining a prediction probability corresponding to each phrase label based on the significance weight and the sample image feature;

and the local loss obtaining module is used for obtaining the local loss based on the prediction probability and the labeling probability corresponding to the phrase labeling.

Optionally, the weighting module is configured to subtract the feature value of each position in the global feature map from the local feature to obtain a local difference feature; calculating a square value of each element in the local difference characteristic to obtain a local joint characteristic; based on the local joint features, a saliency weight is obtained.

Optionally, the weighting module is configured to, when obtaining the significance weight based on the local joint feature, process the local joint feature based on a fully-connected network to obtain a matching value expressing a matching degree between the phrase label and the sample image; and normalizing the vector formed by the matching value of each position in the global feature map corresponding to each phrase label to obtain the significance weight corresponding to each phrase label.

Optionally, the probability prediction module is configured to multiply a feature value of each position in the sample image features by the significance weight to obtain a weighted feature vector set corresponding to each phrase label; adding the vectors in the weighted feature vector set to obtain local visual features in the sample image corresponding to the phrase labels; obtaining a prediction probability of each word in the phrase label based on the local visual features; and determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label.

Optionally, when the probability prediction module obtains the prediction probability of each word in the phrase label based on the local visual features, the probability prediction module is configured to decompose the phrase label into a word sequence, input the local visual features into a long-short term memory network, and determine at least one hidden variable, where each word corresponds to one feature vector; at each moment, the hidden variable at the previous moment and the feature vector corresponding to the current word act through a long-term and short-term memory network to obtain the hidden variable at the next moment; performing linear mapping on the basis of the at least one hidden variable to obtain a prediction vector of each word; and obtaining the prediction probability of each word in the phrase label based on the prediction vector.

Optionally, when the probability prediction module determines the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label, the probability prediction module is configured to take a product of the prediction probabilities of the words in the phrase label as the prediction probability of the phrase label.

Optionally, the loss training module is specifically configured to sum the global loss and the local loss to obtain a sum loss; adjusting a parameter of the feature extraction network based on the sum loss.

Optionally, the apparatus further comprises:

the identity sample unit is used for inputting an identity sample image into the feature extraction network to obtain sample prediction features, wherein the identity sample image comprises mark identification features;

the preset identification unit is used for processing the sample prediction characteristics through a pooling layer and a full-connection layer to obtain prediction identification characteristics;

and the parameter adjusting unit is used for adjusting the parameters of the feature extraction network, the pooling layer and the full-connection layer based on the labeled identification feature and the predicted identification feature.

According to a further aspect of the embodiments of the present application, there is provided an electronic device including a processor, the processor including the pedestrian re-identification apparatus as described in any one of the above.

According to still another aspect of an embodiment of the present application, there is provided an electronic device including: a memory for storing executable instructions;

and a processor in communication with the memory for executing the executable instructions to perform the operations of the pedestrian re-identification method as in any one of the above.

According to still another aspect of the embodiments of the present application, there is provided a computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of the pedestrian re-identification method according to any one of the above.

According to yet another aspect of embodiments of the present application, there is provided a computer program product comprising computer readable code which, when run on a device, executes instructions for implementing a pedestrian re-identification method as described in any one of the above.

Based on the pedestrian re-identification method and device, the electronic device, the storage medium and the program product provided by the above embodiments of the present application, the image to be identified and the candidate image set are obtained; performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained by image feature and language description cross-modal training; the method comprises the steps of obtaining a recognition result corresponding to an image to be recognized from a candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature, wherein the recognition result comprises at least one candidate image, performing pedestrian re-recognition through a feature extraction network obtained through image feature and language description cross-modal training, further mining the correlation between a local image area and noun phrases in a phrase reconstruction mode by utilizing the natural corresponding relation between the image and the language describing the image, enhancing the constraint on image feature learning, improving the quality of pedestrian re-recognition visual features and further improving the accuracy of pedestrian re-recognition.

The technical solution of the present application is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

The present application may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of an embodiment of a pedestrian re-identification method according to the present application.

FIG. 2 is a flowchart of step 130 in an embodiment of the pedestrian re-identification method of the present application.

Fig. 3 is a flowchart of another embodiment of the pedestrian re-identification method of the present application.

FIG. 4 is a flowchart of step 350 in another embodiment of the pedestrian re-identification method of the present application.

FIG. 5 is a flowchart of training for extracting a network according to an embodiment of the present invention.

FIG. 6 is a flowchart illustrating an example of noun phrase extraction in an embodiment of the present application.

FIG. 7 is a structural diagram illustrating an example of an association between a reconstructed phrase label and an image region according to the present disclosure.

Fig. 8 is a schematic structural diagram of a pedestrian re-identification apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of an electronic device suitable for implementing the terminal device or the server according to the embodiment of the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

To further improve the discrimination ability of features, some methods start to use auxiliary data, such as: camera number, body pose, pedestrian attributes, infrared or depth images, and the like. These methods either require additional equipment or models, such as infrared, depth cameras, pose estimation models, etc., to be relied upon during the testing process, increasing the cost of the algorithm running price and time; or a complex labeling format is defined for the auxiliary data, for example, dozens of attributes of people needing to be labeled are compared one by one according to the attributes of the pedestrians, so that the labor cost of data labeling is increased. In view of the above problems, the embodiments of the present disclosure use natural language as auxiliary training data to improve the discriminability and interpretability of image features.

Fig. 1 is a flowchart of an embodiment of a pedestrian re-identification method according to the present application. As shown in fig. 1, the method of this embodiment includes:

and step 110, acquiring an image to be identified and a candidate image set.

The image to be recognized may be a pedestrian image that needs to be re-recognized, the candidate image set may include at least one candidate image, and this embodiment needs to acquire at least one candidate image that matches the image to be recognized from the candidate image set.

And 120, performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained by image feature and language description cross-modal training.

With the application of the deep neural network, in order to further improve the discrimination capability of the features, auxiliary data may be used, in this embodiment, a natural language is used as auxiliary training data to improve the discrimination and interpretability of the image features, optionally, feature extraction is performed on each candidate image in the to-be-recognized image and the candidate image set through a feature extraction network obtained through image feature and language cross-modal training, and an effect of image feature coding obtained by the trained feature extraction network is improved.

And step 130, obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature.

Wherein the recognition result comprises at least one candidate image.

Based on the pedestrian re-identification method provided by the embodiment of the application, an image to be identified and a candidate image set are obtained; performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained by image feature and language description cross-modal training; the method comprises the steps of obtaining a recognition result corresponding to an image to be recognized from a candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature, wherein the recognition result comprises at least one candidate image, performing pedestrian re-recognition through a feature extraction network obtained through image feature and language description cross-modal training, further mining the correlation between a local image area and noun phrases in a phrase reconstruction mode by utilizing the natural corresponding relation between the image and the language describing the image, enhancing the constraint on image feature learning, improving the quality of pedestrian re-recognition visual features and further improving the accuracy of pedestrian re-recognition.

FIG. 2 is a flowchart of step 130 in an embodiment of the pedestrian re-identification method of the present application. As shown in fig. 2, in one or more alternative embodiments, step 130 may include:

in step 1302, the intermediate feature to be identified and the candidate intermediate feature are respectively processed by the average pooling layer and the full connection layer to obtain the feature to be identified and the candidate feature.

In this embodiment, the intermediate features obtained through the feature extraction network may be further subjected to average pooling and full-connection processing to obtain visual features (to-be-identified features and candidate features) describing the to-be-identified image and the candidate image set.

And 1304, obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the feature to be recognized and the candidate feature.

In the embodiment, the similarity between the feature to be recognized and the candidate feature is calculated, so that the recognition result of the image to be recognized can be determined based on the similarity, and the re-recognition of the pedestrian is realized. For example, the distance (e.g., cosine distance, euclidean distance, etc.) between the feature to be recognized and the candidate feature is calculated, and the distance is used as the similarity that can be determined between the image to be recognized and each candidate image.

Fig. 3 is a flowchart of another embodiment of the pedestrian re-identification method of the present application. As shown in fig. 3, in one or more alternative embodiments, the method further includes:

and 340, extracting the features of the description characters related to the image to be recognized based on the language recognition network to obtain the language features.

In the practical application process, when a person (such as a lost child) is searched, the provided image can be assisted with language description, the identification result which is not in conformity with the image can be rapidly screened and removed through the language description content, the efficiency of pedestrian re-identification is improved, and the language description can be the integral description of the image or at least one local description of the corresponding image.

And 350, screening the recognition result based on the language features to obtain an updated recognition result corresponding to the image to be recognized, wherein the updated recognition result comprises at least one candidate image.

FIG. 4 is a flowchart of step 350 in another embodiment of the pedestrian re-identification method of the present application. As shown in fig. 2, in one or more alternative embodiments, step 350 may include:

step 3502, a screening is performed based on the distance between the language feature and at least one candidate intermediate feature corresponding to the recognition result.

The language description and the image belong to two different expression forms, and in order to realize screening of the image based on the language description, processing is needed, in the embodiment, corresponding language features and candidate intermediate features are obtained through a language identification network and a feature extraction network respectively, the similarity between the language description and the image is determined according to the distance between the features (such as Euclidean distance, cosine distance and the like), and then screening of the image based on the language description is realized.

Step 3504, at least one candidate intermediate feature with a distance less than or equal to a preset value is obtained, and a candidate image corresponding to the obtained candidate intermediate feature is used as an update identification result.

In one or more optional embodiments, further comprising:

performing feature extraction on at least one description word related to the image to be recognized based on a language recognition network to obtain word features, wherein each description word corresponds to at least one part in the image to be recognized;

when the pedestrian is re-identified, the whole description words of the image may not be obtained, and only the description words of the local part in the image may be performed, for example: for a pedestrian, the clothing condition of the pedestrian is described, at this time, at least one word feature corresponding to the description word needs to be obtained through the language identification network in the embodiment, and the efficiency of pedestrian re-identification can be improved by screening the identification result based on the word feature.

The recognition results can be screened or updated recognition results can be screened through the word characteristics, and through the screening of the word characteristics, the images can be screened based on the description of part of contents in the images, and the images can be screened based on languages more conveniently.

screening based on the distance between the word characteristics and at least one candidate intermediate characteristic corresponding to the recognition result or the updated recognition result;

optionally, a smaller distance (e.g., euclidean distance, cosine distance, etc.) between two candidate intermediate features indicates a greater degree of association between words or images corresponding to the two features, and therefore, the recognition results or the updated recognition results are screened according to the distance between the candidate intermediate features.

And obtaining at least one candidate feature with the distance smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as a target recognition result.

The description words corresponding to the image to be recognized can comprise at least one, so that the obtained word features also comprise at least one, and the speed of re-recognition of the pedestrian can be increased by screening the candidate intermediate features and each word feature through the distance.

FIG. 5 is a flowchart of training for extracting a network according to an embodiment of the present invention. As shown in fig. 5, the obtaining of the feature extraction network through the cross-modal training of the image features and the language description in this embodiment includes:

step 510, inputting the sample image into a feature extraction network to obtain the sample image features.

Wherein the sample image includes textual description annotations.

And 520, extracting the characteristics of the character description labels based on the language identification network to obtain the sample language characteristics.

In one or more optional embodiments, before step 520 is executed, the speech recognition network may be pre-trained based on sample words, where the sample words include labeled language features, and through the pre-training, the extraction capability of the speech recognition network for the word features may be improved, so that the features extracted by the speech recognition network may more accurately express the features of the words, and more accurate supervision information may be provided for training the feature extraction network.

Optionally, the pre-training process may include: inputting the sample characters into a language identification network to obtain a first prediction sample characteristic;

parameters of the language identification network are adjusted based on the first predicted sample features and the markup language features.

The language identification network used in this embodiment may be any neural network that can implement feature extraction on characters in the prior art, and the specific structure of this embodiment is not limited, and training of the language identification network is similar to training of a general neural network, and may include: obtaining a loss based on the predicted sample features and the markup language features, and adjusting parameters of the language identification network using inverse gradient propagation based on the loss.

Step 530, training a feature extraction network based on the sample image features and the sample language features.

Based on the embodiment of the application, the feature extraction network is trained by combining the description characters, richer labeling information is provided for the sample image, and the accuracy of feature extraction of the feature extraction network is improved.

Optionally, step 530 may include: obtaining a global correlation probability based on the sample language features and the sample image features;

obtaining global loss by utilizing binary mutual entropy loss based on the global correlation probability and the correlation between the sample image and the character description label;

and extracting the network based on the global loss training characteristics.

The obtained global correlation is supervised by using Binary Cross-entropy Loss (Binary Loss) that the joint feature of the image and the language is close to 1 and the joint feature of the image and the language which are not related is close to 0.

Optionally, obtaining the global correlation probability based on the sample language feature and the sample image feature may include:

pooling sample image features, and subtracting the sample image features from the sample language features to obtain difference features;

calculating a square value element by element based on the difference characteristic to obtain a joint characteristic;

and performing normalization processing on the joint features to obtain global correlation probability for expressing global correlation.

When the sample image characteristic psi (I) and the sample language characteristic theta^g(T) describes the same object, e.g. the same pedestrian, through which the association can be made,using discriminant method to pair psi (I) and theta^g(T) correlation between the two groups. The process of supervised learning may be as follows:

jointly representing Ψ (I) and: obtaining the vector after average pooling (average pooling) of Ψ (I)

First, two vectors (theta) are obtained^g(T) with

The difference of the two vectors is used for obtaining a difference vector, and then element-by-element square operation is carried out on each dimension in the difference vector to obtain a joint expression vector (joint characteristic)

To obtain

Can be obtained based on the following formula (1):

wherein the content of the first and second substances,

representing a vector multiplication, two identical vectors are vector multiplied, i.e. the square of the vector.

The purpose of (1) is to express the correlation of two vectors for further predicting whether the two vectors are correlated.

For joint expression vector (joint characteristics)

Linear mapping is carried out and is mapped into the range of (0, 1), and psi (I) and theta are obtained^g(T) global relevance.

In one or more optional embodiments, further comprising:

performing feature extraction on at least one phrase label in the character description label based on a language identification network to obtain at least one local feature, wherein each phrase label is used for describing at least one region in a sample image;

the adopted language identification network can be a language identification network for processing the character description label to obtain a sample language feature shared parameter, or different language identification networks, the phrase label is respectively subjected to feature extraction based on the language identification network, so that local features corresponding to the phrase label can be obtained, and each local feature corresponds to one area in the sample image.

in the network training process, local features obtained by a language identification network are combined, wherein each local feature corresponds to a region in the sample image, and optionally, a binary mutual entropy loss is used to obtain a local loss.

Training a feature extraction network based on global losses, comprising:

and training the feature extraction network based on the global loss and the local loss.

The extraction of the text content may comprise the steps of: an original segment of text associated with an image is preprocessed. Wherein, the original characters used for training can be screened from the network in practical application, and the public data set obtained by the network is used in research.

Optionally, before performing feature extraction on the text description label based on the language identification network to obtain a local language feature, the method further includes:

and segmenting the word description labels to obtain at least one phrase label, wherein each phrase label comprises at least one noun.

The obtained phrase labels correspond to a label probability, and each probability value represents the probability of the corresponding sample image of the phrase labels.

For a whole text describing a picture, a Natural Language Toolkit (NLTK) tool can be used to separate each sentence from the text, to perform part-of-speech tagging on each word in each sentence, and to use a phrase blocking technique to filter out a name phrase with adjectives and a phrase containing a plurality of nouns connected by prepositions with emphasis.

Optionally, segmenting the textual description label to obtain at least one phrase label, including:

and dividing the word description label into at least one phrase label based on the part of speech and the preset phrase blocking condition.

FIG. 6 is a flowchart illustrating an example of noun phrase extraction in an embodiment of the present application. As shown in fig. 6, part-of-speech tagging is performed on the word description tags (such as nouns, adjectives, prepositions, etc.), the tagged words are segmented based on preset rules to obtain at least two phrase tags, the processed language content is encoded, the global language description words and phrase tags are encoded respectively by using LSTM, mapped into feature vectors of specific length, and respectively marked as θ^g(T) and θ^l(P)。

Optionally, determining the local loss based on the local language features and the sample image features comprises:

obtaining a saliency weight based on the global feature map and the local language features;

determining a prediction probability corresponding to each phrase label based on the significance weight and the sample image characteristics;

Specifically, obtaining the saliency weight based on the global feature map and the local language features may include: respectively subtracting the characteristic value of each position in the global characteristic diagram from the local characteristic to obtain a local difference characteristic;

calculating a square value of each element in the local difference value characteristics to obtain local joint characteristics; based on the local union features, a saliency weight is obtained.

Noun phrases generally correspond to each region in a picture. FIG. 7 is a structural diagram illustrating an example of an association between a reconstructed phrase label and an image region according to the present disclosure. As shown in fig. 7. Establishing a bidirectional mapping relation between the phrase labels and the image areas by utilizing a reconstruction mode, wherein the process is divided into the following steps:

generate significance weights: the intermediate layer characteristic Ψ (I) reduces the complexity of object localization by pooling. Feature ψ for each location in post-posing CNN feature map_k(I_n) (the red-labeled region in the gray graph), using the noun phrase feature θ^l(P) reacting therewith.

Optionally, obtaining the significance weight based on the local union feature comprises:

processing the local joint features based on a full-connection network to obtain a matching value expressing the matching degree of the phrase labels and the sample images;

Specifically, the method can specifically comprise the following steps: (1) subtracting the two vectors to obtain a difference vector; (2) carrying out square operation on elements of each dimension in the difference vector to obtain a new vector; (3) the vector obtains a scalar quantity of the matching degree of the sample image and the phrase label through the full-connection network. (4) For all the scalar quantities generated for all the positions, a normalization operation with softmax makes the sum of these scalar quantities one, thus generating a numerical value for each position. The value is a significance weight between 0 and 1. Note that the intermediate layer feature contains a feature vector for each position, and the saliency weight for each position is the saliency weight of the intermediate layer feature.

Optionally, determining a prediction probability corresponding to each phrase label based on the significance weight and the sample image features includes:

multiplying the characteristic value of each position in the sample image characteristic by the significance weight to obtain a weighted characteristic vector set corresponding to each phrase label;

adding vectors in the weighted feature vector set to obtain local visual features in the sample image corresponding to the phrase labels;

obtaining the prediction probability of each word in the phrase label based on the local visual features;

Obtaining visual features related to noun phrases: weighting and multiplying the feature vector of each position in the intermediate layer feature Ψ (I) according to the significance weight to obtain the weighted feature vector of each position, and then adding the weighted vectors of all the positions to obtain the visual feature of a certain area related to the given noun phrase

(the weight of the relevant area is high), the calculation formula of the visual characteristics is shown in formula (2).

Where r refers to the relevance weight, k refers to the index of the location, P is a given noun phrase, I is a picture, and n is the index of the picture.

Optionally, obtaining a predicted probability of each word in the phrase label based on the local visual features comprises:

decomposing the phrase labels into word sequences, inputting the word sequences into a long-term and short-term memory network based on local visual features, and determining at least one hidden variable, wherein each word corresponds to a feature vector;

performing linear mapping based on at least one hidden variable to obtain a prediction vector of each word;

Reconstructing noun phrases by using the obtained visual features: the phrase reconstruction model is constructed based on long short term memory networks (LSTM) and linear mappings. First inputting relevant visual features

Then, the probability of the occurrence of the next word in the phrase is predicted by inputting the previous word in the phrase, and the probability in the word is obtained by linearly mapping a given vocabulary table through the hidden variable in the LSTM and performing softmax normalization. The first and last input words are marked with special symbols for starting and ending phrases, and the hidden variable at the previous moment in the long-short term memory network and the feature vector corresponding to the current word act through the long-short term memory network to obtain the hidden variable at the next moment.

In one or more alternative embodiments, training the feature extraction network based on global loss and local loss includes:

and summing the global loss and the local loss to obtain a sum loss.

Because the word description label can obtain two losses through the language identification network, the global loss and the local loss respectively correspond to the word description label and the label sentence, the word description label and the label sentence respectively describe the whole and the local of the sample image, and the training speed can be accelerated by training the feature extraction network based on the sum loss.

Parameters of the network are extracted based on the sum loss adjustment features.

Based on the global loss and the local loss and the training feature extraction network, the problem of establishing the corresponding relation between noun phrases in description and partial regions in the image under the condition of giving the image and the description corresponding to the image is realized. And further restricting the coding of the image characteristics at a certain position by utilizing the established local corresponding relation.

In one or more optional embodiments, further comprising:

inputting the identity sample image into a feature extraction network to obtain a sample prediction feature, wherein the identity sample image comprises an annotation identification feature;

in the embodiment, both the global feature of the target and the spatial information of the target feature need to be utilized, so as to mine the content of the local image.

Processing the sample prediction characteristics through the pooling layer and the full-connection layer to obtain prediction identification characteristics;

optionally, the extraction is performed by using a classical CNN network, which not only proves to have a strong target classification capability, but also preserves part of spatial information in the feature extraction process, such as: the characteristics of the coded clothes and trousers are respectively coded into characteristic vectors at different positions, and the spatial information has a corresponding relation with the actual position of an object, so that a clue for identification can be provided. Taking ResNet-50 as an example, for a picture with the size of 224x224, a feature map of 8x4 before average pooling (average posing) is taken as a middle feature, labeled as psi (I), for interacting with a language feature. The feature not only encodes high-level semantic information, but also contains spatial position information.

And adjusting parameters of the feature extraction network, the pooling layer and the full-connection layer based on the labeling identification feature and the prediction identification feature.

The above-described embodiments of the present disclosure are directed to enhancing the quality of image feature encoding using auxiliary language description data. The core invention point is that a mechanism for associating language description with image characteristics is established, so that language information can guide the learning of the image characteristics, and visual characteristics are focused on the image appearance with obvious judgment significance of coding. And a discriminant global image and language association strategy is provided based on the individual category information of the pedestrian, and the image language association representation characteristics belonging to the same individual are distinguished from the image language association representation characteristics belonging to different individuals. Meanwhile, the embodiment further utilizes the natural corresponding relation between the image and the language describing the image, further mines the correlation between the local picture area and the noun phrase in a phrase reconstruction mode, and strengthens the constraint on the image feature learning. The proposed technology not only can achieve the purpose of improving the quality of the visual features of pedestrian re-recognition, but also can be potentially used for tasks such as image and language cross-modal retrieval and detection of regions of images according to noun phrases.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Fig. 8 is a schematic structural diagram of a pedestrian re-identification apparatus according to an embodiment of the present application. The apparatus of this embodiment may be used to implement the method embodiments described above in this application. As shown in fig. 8, the apparatus of this embodiment includes:

an image obtaining unit 81 is used for obtaining the image to be identified and the candidate image set.

The feature extraction unit 82 is configured to perform feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network, to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, where the feature extraction network is obtained through image feature and language description cross-modal training.

And a result identifying unit 83, configured to obtain an identification result corresponding to the image to be identified from the candidate image set based on the intermediate feature to be identified and the candidate intermediate feature.

The recognition result includes at least one candidate image.

Based on the pedestrian re-identification device provided by the embodiment of the application, the natural corresponding relation between the image and the language describing the image is utilized, the correlation between the local image area and the noun phrase is further mined in a phrase reconstruction mode, the constraint on the image feature learning is enhanced, the quality of the pedestrian re-identification visual feature is improved, and the accuracy of the pedestrian re-identification is further improved.

In one or more optional embodiments, the result identifying unit 83 is configured to obtain the to-be-identified feature and the candidate feature by respectively passing through the average pooling layer and the full connection layer by the to-be-identified intermediate feature and the candidate intermediate feature; and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the feature to be recognized and the candidate feature.

In one or more optional embodiments, the apparatus of this embodiment may further include:

Optionally, the language screening unit is configured to screen the recognition result based on the language feature, and when an updated recognition result corresponding to the image to be recognized is obtained, screen based on a distance between the language feature and at least one candidate intermediate feature corresponding to the recognition result; and obtaining at least one candidate intermediate feature with the distance smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as an updating identification result.

the word screening unit is used for extracting the characteristics of at least one description word related to the image to be recognized based on the language recognition network to obtain word characteristics, wherein each description word corresponds to at least one part in the image to be recognized; and screening the recognition result or the updated recognition result based on the word characteristics to obtain a target recognition result corresponding to the image to be recognized, wherein the target recognition result comprises at least one candidate image.

Optionally, the word screening unit screens the recognition result or the updated recognition result based on the word feature, and is configured to screen based on a distance between the word feature and at least one candidate intermediate feature corresponding to the recognition result or the updated recognition result when the target recognition result corresponding to the image to be recognized is obtained; and obtaining at least one candidate feature with the distance smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as a target recognition result.

In one or more optional embodiments, the apparatus of this embodiment further includes:

the sample feature extraction unit is used for inputting the sample image into a feature extraction network to obtain sample image features, and the sample image comprises character description labels;

Optionally, the network training unit includes:

the global probability module is used for obtaining global relevant probability based on the sample language features and the sample image features;

the global loss module is used for obtaining global loss by utilizing binary mutual entropy loss based on the global correlation probability and the correlation between the sample image and the character description label;

and the loss training module is used for extracting the network based on the global loss training characteristics.

Optionally, the global probability module is specifically configured to pool sample image features and subtract the sample language features to obtain difference features; calculating the square value of the difference value characteristics element by element to obtain combined characteristics; and performing normalization processing on the joint features to obtain global correlation probability for expressing global correlation.

and the pre-training unit is used for pre-training the language identification network based on sample characters, and the sample characters comprise the marked language features.

Optionally, the pre-training unit is specifically configured to input the sample text into the language identification network to obtain a first predicted sample feature; parameters of the language identification network are adjusted based on the first predicted sample features and the markup language features.

Optionally, the network training unit further includes:

the local feature extraction module is used for performing feature extraction on at least one phrase label in the character description label based on a language identification network to obtain at least one local feature, wherein each phrase label is used for describing at least one region in the sample image;

the local loss module is used for obtaining local loss based on the local characteristics and the sample image characteristics;

and the loss training module is specifically used for extracting the network based on the global loss and local loss training characteristics.

Optionally, the network training unit further includes:

the phrase segmentation module is used for segmenting the word description labels to obtain at least one phrase label, each phrase label comprises at least one noun, the obtained phrase labels correspond to label probability, and each probability value represents the probability of the sample image corresponding to the phrase labels.

Optionally, the phrase segmentation module is specifically configured to perform part-of-speech recognition on each word in the text description label to obtain a part-of-speech corresponding to each word; and dividing the word description label into at least one phrase label based on the part of speech and the preset phrase blocking condition.

Optionally, a local loss module, comprising:

the weighting module is used for obtaining significance weighting based on the global feature map and the local features;

the probability prediction module is used for determining the prediction probability corresponding to each phrase label based on the significance weight and the sample image characteristics;

and the local loss acquisition module is used for acquiring local loss based on the prediction probability and the labeling probability corresponding to the phrase labeling.

Optionally, the weighting module is configured to subtract the feature value of each position in the global feature map from the local feature to obtain a local difference feature; calculating a square value of each element in the local difference value characteristics to obtain local joint characteristics; based on the local union features, a saliency weight is obtained.

Optionally, the weighting module is configured to, when obtaining the significance weight based on the local union feature, process the local union feature based on a full-connection network to obtain a matching value expressing a matching degree between the phrase label and the sample image; and normalizing the vector formed by the matching value of each position in the global feature map corresponding to each phrase label to obtain the significance weight corresponding to each phrase label.

In one or more optional embodiments, the probability prediction module is configured to multiply a feature value of each position in the sample image features by the significance weight to obtain a weighted feature vector set corresponding to each phrase label; adding vectors in the weighted feature vector set to obtain local visual features in the sample image corresponding to the phrase labels; obtaining the prediction probability of each word in the phrase label based on the local visual features; and determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label.

Optionally, the probability prediction module is configured to decompose the phrase label into a word sequence when obtaining the prediction probability of each word in the phrase label based on the local visual features, input the local visual features into the long-term and short-term memory network, and determine at least one hidden variable, where each word corresponds to one feature vector; at each moment, the hidden variable at the previous moment and the feature vector corresponding to the current word act through a long-term and short-term memory network to obtain the hidden variable at the next moment; performing linear mapping based on at least one hidden variable to obtain a prediction vector of each word; and obtaining the prediction probability of each word in the phrase label based on the prediction vector.

Optionally, the probability prediction module is configured to, when determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label, take a product of the prediction probabilities of the words in the phrase label as the prediction probability of the phrase label.

In one or more optional embodiments, the loss training module is specifically configured to sum the global loss and the local loss to obtain a sum loss; parameters of the network are extracted based on the sum loss adjustment features.

the identity sample unit is used for inputting the identity sample image into the feature extraction network to obtain sample prediction features, wherein the identity sample image comprises mark identification features;

the preset identification unit is used for processing the sample prediction characteristics through the pooling layer and the full-connection layer to obtain prediction identification characteristics;

and the parameter adjusting unit is used for adjusting the parameters of the feature extraction network, the pooling layer and the full connection layer based on the labeling identification feature and the prediction identification feature.

According to another aspect of the embodiments of the present application, there is provided an electronic device including a processor including the pedestrian re-identification apparatus as described in any one of the above.

According to another aspect of the embodiments of the present application, there is provided an electronic device including: a memory for storing executable instructions;

and a processor in communication with the memory for executing the executable instructions to perform the operations of the pedestrian re-identification method as described in any one of the above.

The embodiment of the invention also provides electronic equipment, which can be a mobile terminal, a Personal Computer (PC), a tablet computer, a server and the like. Referring now to fig. 9, a schematic diagram of an electronic device 900 suitable for implementing a terminal device or a server according to an embodiment of the present application is shown: as shown in fig. 9, the electronic device 900 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPUs) 901, and/or one or more image processors (GPUs) 913 and the like, which can perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)902 or loaded from a storage section 908 into a Random Access Memory (RAM) 903. Communications portion 912 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card,

the processor may communicate with the read-only memory 902 and/or the random access memory 903 to execute executable instructions, connect with the communication part 912 through the bus 904, and communicate with other target devices through the communication part 912, so as to complete operations corresponding to any method provided by the embodiments of the present application, for example, obtaining an image to be identified and a candidate image set; performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained through image feature and language description cross-modal training; and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature.

In addition, in the RAM903, various programs and data necessary for the operation of the device can also be stored. The CPU901, ROM902, and RAM903 are connected to each other via a bus 904. The ROM902 is an optional module in case of the RAM 903. The RAM903 stores or writes executable instructions into the ROM902 at runtime, and the executable instructions cause the central processing unit 901 to perform operations corresponding to the above-described communication methods. An input/output (I/O) interface 905 is also connected to bus 904. The communication unit 912 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

It should be noted that the architecture shown in fig. 9 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 9 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, GPU913 and CPU901 may be separately provided or GPU913 may be integrated on CPU901, the communication part may be separately provided, or CPU901 or GPU913 may be integrated, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flowchart, the program code may include instructions corresponding to performing the method steps provided by embodiments of the present disclosure, e.g., obtaining an image to be identified and a set of candidate images; performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained through image feature and language description cross-modal training; and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the method of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium for storing computer-readable instructions which, when executed, perform the operations of the pedestrian re-identification method as described in any one of the above.

According to another aspect of embodiments herein, there is provided a computer program product comprising computer readable code which, when run on an apparatus, causes a processor in the apparatus to execute instructions for implementing a pedestrian re-identification method as described in any one of the above.

The methods and apparatus of the present application may be implemented in a number of ways. For example, the methods and apparatus of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present application are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present application may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present application. Thus, the present application also covers a recording medium storing a program for executing the method according to the present application.

The description of the present application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the application in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the application and the practical application, and to enable others of ordinary skill in the art to understand the application for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A pedestrian re-identification method, comprising:

acquiring an image to be identified and a candidate image set;

performing feature extraction on the image to be recognized and each candidate image in the candidate image set by using a feature extraction network to obtain an intermediate feature to be recognized corresponding to the image to be recognized and a candidate intermediate feature corresponding to the candidate image, wherein the feature extraction network is obtained by image feature and language description cross-modal training;

obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature, wherein the recognition result comprises at least one candidate image;

the obtaining of the feature extraction network through cross-modal training of image features and language description comprises:

2. The method according to claim 1, wherein the obtaining a corresponding recognition result of the image to be recognized from the candidate image set based on the intermediate feature to be recognized and the candidate intermediate feature comprises:

3. The method of claim 2, further comprising: extracting the features of the description characters related to the image to be recognized based on a language recognition network to obtain language features;

4. The method according to claim 3, wherein the screening the recognition result based on the language feature to obtain an updated recognition result corresponding to the image to be recognized comprises:

5. The method of claim 4, further comprising:

6. The method according to claim 5, wherein the screening the recognition result or the updated recognition result based on the word feature to obtain a target recognition result corresponding to the image to be recognized comprises:

and obtaining at least one candidate feature with the distance smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as the target recognition result.

7. The method of claim 1, wherein training the feature extraction network based on the sample language features and the sample image features comprises:

training the feature extraction network based on the global loss.

8. The method of claim 7, wherein obtaining a global correlation probability based on the sample language features and the sample image features comprises:

9. The method of claim 1, wherein before the extracting the feature of the word description label based on the language identification network to obtain the sample language feature, the method further comprises:

10. The method of claim 9, wherein the pre-training the language recognition network based on sample words comprises:

11. The method of any of claims 7-8, further comprising: performing feature extraction on at least one phrase label in the character description label based on the language identification network to obtain at least one local feature, wherein each phrase label is used for describing at least one region in the sample image;

the training the feature extraction network based on the global loss comprises:

12. The method of claim 11, wherein before performing feature extraction on at least one phrase label in the textual description labels based on the language identification network to obtain at least one local feature, the method further comprises:

and segmenting the word description labels to obtain at least one phrase label, wherein each phrase label comprises at least one noun, the phrase label corresponds to a label probability, and each probability value represents the probability that the phrase label corresponds to the sample image.

13. The method of claim 12, wherein said segmenting said textual description label into at least one phrase label comprises:

14. The method of claim 11, wherein deriving the local loss based on the local feature and the sample image feature comprises:

15. The method of claim 14, wherein obtaining a saliency weight based on the global feature map and the local features comprises:

based on the local joint features, a saliency weight is obtained.

16. The method of claim 15, wherein obtaining a significance weight based on the local joint feature comprises:

17. The method according to any one of claims 14-16, wherein the determining the prediction probability corresponding to each phrase label based on the saliency weight and the sample image features comprises:

18. The method of claim 17, wherein obtaining a prediction probability for each word in the phrase label based on the local visual features comprises:

19. The method of claim 18, wherein determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label comprises:

20. The method of claim 11, wherein training the feature extraction network based on the global loss and the local loss comprises:

summing the global loss and the local loss to obtain a sum loss;

adjusting a parameter of the feature extraction network based on the sum loss.

21. The method of claim 1, further comprising:

22. A pedestrian re-recognition apparatus, comprising:

a result identification unit, configured to obtain an identification result corresponding to the image to be identified from the candidate image set based on the intermediate feature to be identified and the candidate intermediate feature, where the identification result includes at least one candidate image;

the device further comprises:

23. The apparatus of claim 22, wherein the result identification unit is configured to obtain the to-be-identified feature and the candidate feature through an average pooling layer and a full connection layer, respectively, for the to-be-identified intermediate feature and the candidate intermediate feature; and obtaining a recognition result corresponding to the image to be recognized from the candidate image set based on the feature to be recognized and the candidate feature.

24. The apparatus of claim 23, further comprising:

25. The apparatus according to claim 24, wherein the language screening unit is configured to, when the recognition result is screened based on the language feature to obtain an updated recognition result corresponding to the image to be recognized, screen based on a distance between the language feature and at least one candidate intermediate feature corresponding to the recognition result; and obtaining at least one candidate intermediate feature of which the distance is smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as the updating and identifying result.

26. The apparatus of claim 25, further comprising:

27. The apparatus according to claim 26, wherein the word filtering unit filters the recognition result or the updated recognition result based on the word feature, and is configured to filter based on a distance between the word feature and at least one candidate intermediate feature corresponding to the recognition result or the updated recognition result when obtaining the target recognition result corresponding to the image to be recognized; and obtaining at least one candidate feature of which the distance is smaller than or equal to a preset value, and taking a candidate image corresponding to the obtained candidate intermediate feature as the target identification result.

28. The apparatus of claim 22, wherein the network training unit comprises:

29. The apparatus of claim 28, wherein the global probability module is specifically configured to pool the sample image features and subtract the sample language features to obtain difference features; calculating the square value of the difference value characteristic element by element to obtain a combined characteristic; and performing normalization processing on the combined features to obtain global correlation probability for expressing global correlation.

30. The apparatus of claim 22, further comprising:

31. The apparatus of claim 30, wherein the pre-training unit is specifically configured to input the sample text into the speech recognition network to obtain a first predicted sample feature; adjusting parameters of the language identification network based on the first predicted sample features and the markup language features.

32. The apparatus according to any of claims 28-29, wherein the network training unit further comprises:

33. The apparatus of claim 32, wherein the network training unit further comprises:

the phrase segmentation module is used for segmenting the word description labels to obtain at least one phrase label, each phrase label comprises at least one noun, the phrase label corresponds to a label probability, and each probability value represents the probability that the phrase label corresponds to the sample image.

34. The apparatus of claim 33, wherein the phrase segmentation module is specifically configured to perform part-of-speech recognition on each word in the textual description label to obtain a part-of-speech corresponding to each word; and dividing the word description label into at least one phrase label based on the part of speech and a preset phrase blocking condition.

35. The apparatus of claim 32, wherein the local loss module comprises:

a probability prediction module for determining a prediction probability corresponding to each phrase label based on the saliency weight and the sample image features;

36. The apparatus of claim 35, wherein the weighting module is configured to subtract the feature value at each position in the global feature map from the local feature to obtain a local difference feature; calculating a square value of each element in the local difference characteristic to obtain a local joint characteristic; based on the local joint features, a saliency weight is obtained.

37. The apparatus of claim 36, wherein the weighting module, when obtaining the significance weight based on the local joint feature, is configured to process the local joint feature based on a fully-connected network to obtain a matching value expressing a degree of matching between the phrase label and the sample image; and normalizing the vector formed by the matching value of each position in the global feature map corresponding to each phrase label to obtain the significance weight corresponding to each phrase label.

38. The apparatus according to any one of claims 35 to 37, wherein the probability prediction module is configured to multiply the feature value of each position in the sample image feature by the significance weight to obtain a weighted feature vector set corresponding to each phrase label; adding the vectors in the weighted feature vector set to obtain local visual features in the sample image corresponding to the phrase labels; obtaining a prediction probability of each word in the phrase label based on the local visual features; and determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label.

39. The apparatus of claim 38, wherein the probability prediction module, when obtaining the prediction probability of each word in the phrase label based on the local visual features, is configured to decompose the phrase label into word sequences, input the local visual features into a long-short term memory network, and determine at least one hidden variable, where each word corresponds to a feature vector; at each moment, the hidden variable at the previous moment and the feature vector corresponding to the current word act through a long-term and short-term memory network to obtain the hidden variable at the next moment; performing linear mapping on the basis of the at least one hidden variable to obtain a prediction vector of each word; and obtaining the prediction probability of each word in the phrase label based on the prediction vector.

40. The apparatus of claim 39, wherein the probability prediction module is configured to take a product of the prediction probabilities of the words in the phrase label as the prediction probability of the phrase label when determining the prediction probability corresponding to the phrase label based on the prediction probability of each word in the phrase label.

41. The apparatus according to claim 32, wherein the loss training module is specifically configured to sum the global loss and the local loss to obtain a sum loss; adjusting a parameter of the feature extraction network based on the sum loss.

42. The apparatus of claim 22, further comprising:

43. An electronic device, comprising a processor including the pedestrian re-identification apparatus of any one of claims 22 to 42.

44. An electronic device, comprising: a memory for storing executable instructions;

and a processor in communication with the memory for executing the executable instructions to perform the operations of the pedestrian re-identification method of any one of claims 1 to 21.

45. A computer readable storage medium storing computer readable instructions which, when executed, perform the operations of the pedestrian re-identification method of any one of claims 1 to 21.