NL2028092B1

NL2028092B1 - Cross-modality person re-identification method based on dual-attribute information

Info

Publication number: NL2028092B1
Application number: NL2028092A
Authority: NL
Inventors: Wang Yinglong; Song Xuemeng; Gao Zan; Nie Liqiang; Chen Lin
Original assignee: Shandong Artificial Intelligence Inst
Priority date: 2020-08-12
Filing date: 2021-04-29
Publication date: 2022-04-06
Also published as: CN112001279B; NL2028092A; CN112001279A

Abstract

The present disclosure provides a cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross-modality person re-identification network based on hidden space and attribute space is constructed to improve semantic expressiveness of a feature extracted by using a model. To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and make full use of the attribute information of a person.

Description

-1- CROSS-MODALITY PERSON RE-IDENTIFICATION METHOD BASED ON DUAL-

ATTRIBUTE INFORMATION

TECHNICAL FIELD The present disclosure relates to the fields of computer vision and deep learning, and specifically, to a cross-modality person re-identification method based on dual- attribute information.

BACKGROUND In the information age, video surveillance plays an invaluablerole in maintaining public safety. Person re-identification is a crucial subtask in a video surveillance scenario, and is intended to find photos of a same person from image data generated by different surveillance cameras. Public safety monitoring facilities are increasingly widely applied, resulting in massive image data collection. How to quickly and accurately find a target person in the massive image data is a research hotspot in the field of computer vision. However, in some specific emergency scenarios, an image matching a to-be-found person cannot be provided in time as a basis for retrieval, and only an oral description can be provided. Therefore, cross-modality person re- identification based on a text description emerges. Cross-modality person re-identification is to find, in an image library based on a natural language description of a person, an image most conforming to text description information. With the development of deep learning technologies and their superior performance in different tasks, researchers have proposed some deep learning-related cross-modality person re-identification algorithms. These algorithms can be roughly classified into: 1} a semantic intimacy value calculation method, which is used to calculate an intimacy value of a semantic association between an image and text, to improve intimacy between an image and text that belong to a same class, and reduce intimacy between an image and text that belong to different classes; and 2) a subspace method, which is intended to establish shared feature expression space for images and text, and uses a metric learning strategy in the shared feature expression space to decrease a distance between image and text features belonging to a same person identity (ID) and to increase a distance between image and text features belonging to different person IDs. However, semantic expressiveness of features extracted by using these methods still needs to be improved. These methods

-2- ignore or do not fully consider effectiveness of using attribute information of persons to express semantic concepts.

SUMMARY To overcome disadvantages of the above technology, the present disclosure provides a cross-modality person re-identification method by using a space construction and attribute fusion algorithm based on text and image attributes. The technical solution used in the present disclosure to resolve the technical problem thereof is as follows: A cross-modality person re-identification method based on dual-attribute information includes the following steps: a) extracting a text description feature 7 and an image feature { of a person from content obtained by a surveillance camera; Cr b) extracting a text attribute feature from an extracted text description of

C the person, and extracting an image attribute feature from an extracted image; ¢) inputting the text description feature and the image feature of the person in the step a} to shared subspace, calculating a triplet loss function of a hard sample, and calculating a classification loss of a feature in the shared subspace by using a Softmax loss function; d) fusing the text description feature 7 and the image feature { of the person Cr ¢; with the text attribute feature and the image attribute feature ; e) constructing feature attribute space based on attribute information; and f) retrieving and matching the extracted image feature and text description feature of the person. Further, the extracting a text description feature of a person in the step a) includes the following steps:

-3- a-1.1) segmenting words in a description statement of the content obtained by the surveillance camera, and establishing a word frequency table; a-1.2) filtering out a low-frequency word in the word frequency table; a-1.3} performing one-hot encoding to encode a word in the word frequency table; and a-1.4) performing feature extraction on the text description of the person by using a bidirectional long short-term memory (LSTM) model.

Further, the extracting an image feature in the step a) includes the following steps: a-2.1) performing feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set; and a-2.2} performing semantic segmentation on the extracted image, and performing, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.

Further, the step b) includes the following steps: b-1} preprocessing data of the text description of the person by using a natural language toolkit (NLTK) tool library, and extracting a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns; b-2) sorting the extracted noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature ‚and b-3) training the image by using a PA-100K data set, to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image Cr attribute feature . Further, the step c) includes the following steps:

-4- LT) . . (rip : c-1) calculating a triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , 1, eene Ie I where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k represents a feature of the kth text description of the person, k is used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation; lL, 1 c-2) calculating a cosine similarity between ‘* and “% according to a formula [, */ _ I; 1;

TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace;

-5- pe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace. . . L, tent (1, T) 5 c-4) calculating a loss function ‘> of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ”? represents a quantity of samples in one batch.

-6- Further, the step d) includes the following steps:

L LT d-1) calculating a loss function corat ( ° ) according to a formula à i 2 Lio (/, T) — 4] P IC, Cl Vv , where the image feature { is constituted by 1, - 7 EO Jr, ©, the text description feature { of the person is constituted by , represents dimensions of Ti and Ti , and | I represents a Frobenius norm; d-2) calculating, according to a formula t =sigmoid(CxU_ + FxU.) g Í 1 ‚ weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, / represents a to-be-fused image or text feature, £ and / are projection matrices, { represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR U, er 5 function, © , : represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space; and d-3) calculating a fused feature A according to a formula A=1x|CxW,| +0-92|[FxW;| Woe RO Ello J 2 g , where , and daxda W.eR represents a projection matrix. Further, the step e) includes the following steps:

-7- : : Ly pind, T) . . e-1) calculating a triplet loss ’ of the attribute space according to a formula ’ _ s sn} K sp Lom(LT)= SY max{ p,+S, (77, | S, (1,7, ).0} I el Ts sm} s sp +3 max( py +5, (77.1, S, (7; i ).0) hel , where Pp s‚() Co represents a boundary of the triplet loss, ° represents cosine similarity [5 Il: calculation, k represents a feature of the k th image in the attribute space, kis

T SH IN used as an anchor, k represents a feature, closest to the anchor fe , of the TT I’ heterogeneous text sample, k represents a feature, farthest from the anchor k 7: , of the congeneric text sample, x represents afeature of the kth text description T° J sn of the person in the attribute space, k is used as an anchor, £ represents a 7: [7 feature, closest to the anchor k , of the heterogeneous text sample, and k T° represents a feature, farthest from the anchor k ofthe congeneric text sample; a, ar e-2)calculating a cosine similarity between * and “% according to a formula xk S,(1,.T,) = te he lele] a a a a I, 7. k “ti where It and Ty respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space; and

-8- Lom T) e-3) calculating a loss function 77 of the attribute space according to L (IT I L I.TY+L LT air ( > ) - arip » )+ cora ( » ) a formula n . Further, the step f} includes the following steps: L{,T f-1) calculating a loss function ( > ) of a dual-attribute network according to a formula L({, T) = Latent (1, 7) + Lin: (/. T) . ALT) f-2) calculating a similarity between dual attributes according to a AULT) = 41, 15 )+A (a .a;) 4, formula ' ' ’ , where represents a calculated similarity between the features > learned from the Ae shared subspace, and © represents a calculated similarity between the features a 7 - a, . k learned from the attribute space; and f-3) calculating cross-modality matching accuracy based on the similarity Al, T,) The present disclosure has the following beneficial effects: The cross-modality person re-identification method based on dual-attribute information extracts rich semantic information by making full use of data of two modalities.

A space construction and attribute fusion algorithm based on text and image attributes is provided.

An end-to-end cross-modality person re-identification network based on hidden space and the attribute space is constructed to improve semantic expressiveness of a feature extracted by using the model.

To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the

-9- attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and making full use of the attribute information of the person.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of the present disclosure; FIG. 2 shows changes of a loss function in a model training process according to the present disclosure; and FIG. 3 compares the method in the present disclosure and an existing method in terms of Top-k on a CUHK-PEDES data set.

DETAILED DESCRIPTION OF THE EMBODIMENTS The present disclosure is further described with reference to FIG. 1, FIG. 2, and FIG. 3. As shown in FIG. 1, a cross-modality person re-identification method based on dual-attribute information includes the following steps. a) Extract a text description feature 7’ and an image feature 7 of a person from content obtained by a surveillance camera. The present disclosure is intended to establish a semantic association between an image captured by the surveillance camera for a person in a real scenario and a corresponding text description of the person. Feature representations of data of two modalities need to be extracted separately. The image feature is extracted by using a currently popular convolution neural network ResNet, and the text feature is extracted by using a bidirectional LSTM, so that text context information can be fully obtained. Cr b) Extract a text attribute feature from an extracted text description of the cj person, and extract an image attribute feature from an extracted image. To resolve a problem that a semantic expressiveness of a feature is poor because an existing method does not make full use of attribute information, the present disclosure is designed to use attribute information of the person as auxiliary information to improve the semantic expressiveness of image and text features. An image attribute of the person is extracted by using an existing stable person-specific

-10- image attribute extraction model. A text attribute of the person comes from statistical information in a data set, and a noun phrase with a relatively high word freguency in the data set is used as the text attribute of the person in the present disclosure.

c} Input the text description feature and the image feature of the person in the step a) to shared subspace, calculate a triplet loss function of a hard sample, and calculate a classification loss of a feature in the shared subspace by using a Softmax loss function. Projection to shared vector space is a frequently used method for resolving a cross-modality retrieval problem. In the shared vector space, an association between data of two modalities can be established. The present disclosure projects the extracted image and text features to the shared vector subspace, and adopts metric learning to decrease a distance between image and text features with same person information and increase a distance between image and text features belonging to different persons. The present disclosure uses a triplet loss of the hard sample to achieve the above purpose. That is, in a batch of data, a heterogeneous sample of another modality and closest to anchor data, and a congeneric sample of the another modality and farthest from the anchor data need to be found.

d) Fuse the text description feature 7 and the image feature { of the person . . Cr . : C with the text attribute feature and the image attribute feature ‚ The existing method does not make full use of an auxiliary function of the attribute information or uses only attribute information of one modality, resulting in poor semantic expressiveness of a feature that can be extracted by using the model. To resolve this problem, the present disclosure uses the extracted dual-attribute information, namely, the image and text attributes. Considering that different attributes play different roles in image and text matching of the person, the present disclosure uses a weight mechanism to enable semantic information to play a more important role in feature fusion. The present disclosure uses a strategy of matrix projection to project to-be-fused image and text features and attribute features to same dimensional space, and then weights the two types of features to obtain image

-11- and text features fused with the semantic information. Before feature fusion, to avoid a large difference between distribution of features of two modalities, the present disclosure uses a frequently used loss function coral to minimize a difference between distribution of data of two modalities.

e) Construct feature attribute space based on the attribute information, which is referred to as attribute space in the present disclosure. The image and text features fused with the semantic information are also sent to the shared subspace. In the present disclosure, the image and text features with the same person information have a same semantic meaning by default. In the attribute space, the present disclosure still uses the triplet loss of the hard sample to establish a semantic association between image and text features that are of the person and are of different modalities.

f) Retrieve and match the extracted image feature and text description feature of the person. The finally extracted image and text features in the present disclosure include features extracted from hidden space and features extracted from the attribute space. When the extracted model features are retrieved and matched, a cosine distance is used to calculate a distance between two model features in feature space, to measure their similarity. To make ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space complementary, the present disclosure adds up similarity matrices of the two types of features before sorting.

To resolve a problem that the existing cross-modality person re-identification method cannot effectively use the attribute information of the person as the auxiliary information to improve the semantic expressiveness of the image and text features, the present disclosure provides an efficient cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross- modality person re-identification network based on the hidden space and the attribute space is constructed to improve the semantic expressiveness of the feature extracted by using the model. To resolve a problem of cross-modality person re- identification based on an image and text, a new end-to-end cross-modality person

-12- identification network based on the hidden space and the attribute space is proposed, to greatly improve the semantic expressiveness of the extracted feature and make full use of the attribute information of the person.

Embodiment 1 The extracting a text description feature of a person in the step a} includes the following steps: a-1.1) Preprocess text information when performing feature extraction on text of the person, in other words, segment words in a description statement of the content obtained by the surveillance camera, and establish a word frequency table.

a-1.2) Filter out a low-frequency word in the word frequency table.

a-1.3) Perform one-hot encoding to encode a word in the word frequency table.

a-1.4) Perform feature extraction on the text description of the person by using a bidirectional LSTM model. The bidirectional LSTM model can fully consider a context of each word, so that richer text features are learned.

The extracting an image feature in the step a) includes the following steps: a-2.1) Perform feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set.

a-2.2} Perform semantic segmentation on the extracted image, and perform, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.

Embodiment 2 Many efforts have been made for person-specific image attribute identification, and good effects have been achieved. The present disclosure uses a stable person- specific attribute identification model to extract an attribute contained in an image of a person and a possible attribute value in the data set. The step b) includes the following steps: b-1) Preprocess data of the text description of the person by using an NLTK tool library, and extract a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns.

-13- b-2} Sort the extracted noun phrases based on a word freguency, discard a low- freguency phrase, and construct an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature . b-3) Train the image by using a PA-100K data set, to obtain 26 prediction values, and mark an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image attribute feature C, | Embodiment 3 The present disclosure uses a frequently used shared subspace method in the field of cross-modality person re-identification to establish an association between feature vectors of two modalities.

The hidden space is set to enable both the image feature and the text feature of the person to have separability of a person ID, and to enable the image and text features to have a basic semantic association.

Considering that, in cross-modality person-specific image and text retrieval, a same person ID corresponds to a plurality of images and a plurality of corresponding text descriptions, the present disclosure designs the loss function to decrease a distance between an image and a text description that belong to a same person ID, and increase a distance between an image and text that belong to different person IDs.

Specifically, data of one modality is used as an anchor.

Data that is of another modality and belongs to a type the same as that of the anchor is used as a positive sample, and data belonging to a type different from that of the anchor is used as a negative sample.

This not only realizes classification, but also establishes, to a certain extent, a correspondence between an image and a text description that have same semantics but are of different modalities.

In an experiment, an image and a text description of a same person have same semantic information by default.

The step c) includes the following steps:

-14- : Lip (1, 7) : c-1) Calculate the triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , I, kth; I, . 1’ where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k representsa feature of the k th text description of the person, kis used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation. /, be c-2) Calculate a cosine similarity between * and + according to a formula [, */ _ I; 1;

TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace.

-15- fe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace.

Laten LT) c-4) Calculate a loss function er! of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ”? represents a quantity of samples in one batch.

-16- Embodiment 4 Before the image and text features are fused with the attribute features, to avoid an excessive difference between distribution of data of two modalities, the present disclosure uses the coral function in transfer learning to decrease a distance between the data of the two modalities.

Specifically, the step d) includes the following steps: L (IT d-1) Calculate a loss function cora ’ ) according to a formula à i 2 Loret U, T) == 4] 2 IC, Cl y| , where the image feature / is constituted by Á, u , a Jr *, the text description feature { of the person is constituted by vo, ioe Sit ang i represents dimensions of © and “©, and represents a Frobenius norm, d-2) Calculate, according to a formula t = sigmoid(CxU, + FxU,) S . 3 3 , weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, I represents a to-be-fused image or text feature, & and J are projection matrices, l represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR”“ U, eR S function, © ‚ - represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space.

-17- d-3) Calculate a fused feature 4 according to a formula A=t1=|CxW,| +10) |FxW | We RV g 2 f 2 g , where 5 ‚ and daxda W,eR ” represents a projection matrix.

Embodiment 5 In the hidden space, the triplet loss is used to establish the association between the image feature and the text feature. in the attribute space, the triplet loss of the hard sample is used to establish a semantic association between features of different modalities.

Therefore, the step e) includes the following steps: . Li > (/, 1) . . e-1) Calculate a triplet loss “77% of the attribute space according to a == Ss sn} Ss Sp LoL T)=Y max p, +S. (7; 7 S, (7,7, | 0) I el Ts SH _ & Sp +> max{9, +5, (1, vi 5.7 yh ).0) formula Teel . S . , where Pp represents a boundary of the triplet loss, J ) represents cosine [5 similarity calculation, k represents a feature of the k th image in the attribute [5 T SH space, K is used as an anchor, k represents a feature, closest to the anchor I 7% k k , of the heterogeneous text sample, represents a feature, farthest from the I? Ts anchor * , of the congeneric text sample, k represents afeature of the kth text T° J Sn description of the person in the attribute space, £ is used as an anchor, £ Tk represents a feature, closest to the anchor k of the heterogeneous text sample,

-18-

1.7 1’ and £ represents a feature, farthest from the anchor ~ ©, of the congeneric text sample, a; a e-2)Calculate a cosine similarity between ‘* and “* according to a formula % S,(1,.1,) = SE Jalen | ad ad a a I 7. k “4 where Lk and Tk respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space. Lom 1. T) e-3) Calculate a loss function ¢ "7 of the attribute space according to a LT L 1.T)+L LT ain ( * ) = arin ( 2 )+ corat ( 2 ) formula n . Embodiment 6 In a process of model learning, the hidden space and the attribute space are trained at the same time. The step f) includes the following steps:

LT f-1) Calculate a loss function ( ? ) of a dual-attribute network according to LLT) = LI + LT a formula ( > ) tent 1) car ’ ) As shown in FIG. 2, change curves of the three loss functions in a training process are roughly consistent. This proves applicability and rationality of the present disclosure. f-2) To make the ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space . sr A(T) complementary in a test process, calculate a similarity * between dual attributes according to a formula

-19- ALT) =A(1, 1, )+4 (9, a) A, ‚ where represents a calculated similarity between the features Le?" Te earned from the shared subspace, and Ac represents a calculated similarity between the features a, > da, .

k 4 learned from the attribute space. f-3) Calculate cross-modality matching accuracy based on the finally obtained ALT) similarity “ It is proved that, as shown in FIG. 3, performance of the method in the present disclosure is significantly improved compared with performance of the existing five methods listed in the table. The above embodiments are only used for describing the technical solutions of the present disclosure and are not intended to limit the present disclosure. Although the present disclosure is described in detail with reference to the embodiments, those of ordinary skill in the art should understand that various modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure, and such modifications or equivalent substitutions should be encompassed within the scope of the claims of the present disclosure.

Claims

-20- Conclusions L Method for re-identification of persons across multiple modalities based on bipartite attribute information, comprising the following steps: a) extracting, from content obtained by a surveillance camera, a text description feature 7" and an image feature / of a person; b) extracting, from a featured text description of the person, a Cr l Ln Ln text attribute attribute and extracting, from a featured image, a C, image attribute attribute; c) entering, in shared subspace, the text description feature and the image feature of the person in step a) calculating a triple loss function of a hard sample and, using a Softmax loss function, calculating a classification loss of a feature in the shared subspace; d) concatenating the text description attribute 7" and the image attribute

C I of the person with the text attribute attribute Ten the C, image attribute attribute ; e) constructing feature attribute space based on attribute information; and f) retrieving and fitting the highlighted image feature and the text description feature of the person.

The method of re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein extracting a text description feature of a person in step a) comprises the steps of: a-1.1) segmenting words in a description entry of the contents obtained by the surveillance camera, and establishing a word frequency table, a-1.2) filtering out a low frequency word in the word frequency table, a-1.3) performing one-hot- “) encoding to place a word in the

221 - word frequency table to be encoded; and a-1.4) performing feature highlighting on the person's text description using a two-way Long Short Term Memory (LSTM) model.

The method of re-identification of persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the extraction of an image feature in the step a) comprises the following steps: a-2.1) using a ResNet that has been pre-trained is on an image net data set, performing feature extraction on the image; and a-2.2) performing semantic segmentation on the extracted image, and, using ResNet in step a-2.1), performing feature highlighting on an image obtained after semantic segmentation.

The method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step b) comprises the following steps: b-1) using a natural language toolkit, NLTK -) tool library, preprocessing data from the person's text description, and extracting a noun phrase made up of an adjective plus a noun and a noun phrase made up of a plurality of superimposed nouns; b-2) sorting the featured noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing, using the first 400 noun phrases, a Cr... attribute table to obtain the text attribute attribute; and b-3) using a PA-100K dataset, training the image to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a cj prediction value which is less than 0 as 0 around the image attribute attribute

-22- available.

The method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step c) comprises the following steps: . † L,, (1 2 T ) c-1) calculating a triple loss of the hard sample according to ’ ! — TH a. 'p Lip 1,7) =_ > max(, +S(1, 1; ) S(T, ).0) I el : ny p +> max(p, +S(T,. 1;)=S(T,.1!'),0) Ter a formula ° ‚ where k represents a feature of the k-e map, where k as an anchor . † I' . † I, is used, where a feature represents a feature closest to the anchor Tr of a heterogeneous text sample, where a feature represents a feature furthest from the anchor k of a similar text sample, where a feature of the k-e text description of the person, where k is used as an I” anchor, where © represents a feature closest to the anchor T, | Ip of the heterogeneous text sample is where * represents a feature furthest from the anchor k of the similar text sample, where Pr represents a triple loss boundary and where ( ) represents a cosine resemblance calculation; c-2) calculating a cosine similarity between Te on Tr according to a formula

-23- [, */ I. 7. S ( I, i T, ) — lk k | |A and m , where * represents an attribute of the k-e map in the shared subspace and where Ti represents an attribute of the k-e text description of the person in the shared subspace; c-3) calculating a classification loss cls ( k ) of the image feature k in the shared subspace according to a formula 7 CXp (47 Woe + D | ex +0. Dc p I J J .. I ‚ where a transposed image feature in the shared subspace, where W represents a dIxC classifier, where WeR , dl represents an attribute size of the shared subspace, where C represents information about classes of quantity of identity (“identity”, ID) of a person, where Jk represents ID information of

T represents, where 7 represents a bias vector, where © represents a classification vector of the /* class, where J represents a bias value of the j-th class, .. We . † †

where represents a corresponding classification vector of the vk-e class and where * represents gene bias value of the Vvk-e class, and calculating a classification loss function as 0) of the text description feature k of the person in the shared subspace according to a formula

24.7 exp; Wa +b,] | Ly (Ti) = wap TW ) €X . +0, T ‚ where k represents a transposed text attribute in the shared subspace; and c-4) calculating a loss function aen! ( ' ) of the shared subspace according to a formula 7 1 1 latent (7, 1) - Lip (J, T) +— > (Ls Ji) + Los (7; ) hn nk where a quantity of samples in one batch represents.

A method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 5, wherein the step d) comprises the following steps: d-1) calculating a loss function Lora (1,7) according to a formula 1 2 Lorat (1, 1) = 4] 2 IC, u Cy I. 7, v| † † , in which the image characteristic / is built up from Zx, where the text description feature 7' of the person is built up from J, where dimensions of and - represent and where a Frobenius norm is represented; t =sigmoid(CxU, + FxU,) d-2) the calculation, according to a formula 5 : , of weights of the attribute attribute and the image or text attribute during attribute fusion, where C' represents an attribute attribute to be fused, where 1” | CU, U, represents an image or text feature to be fused, where & and are projection matrices, where I represents a weighting during feature fusion obtained by summing projection results and processing a

25. sxda

U ER result obtained using a sigmoid function, ‚ where daxda U, eR * represents a projection matrix, where S represents a quantity of image attribute classes or text attribute classes and where da represents an attribute size of the attribute space; and d-3) calculating a fused feature A according to a formula fx _ | xd A=ts|CxW‚| +(=0)x|FxW | CW, eR ‚ where < and daxda W, € R Co : represents a projection matrix.

The method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step e) comprises the following steps: . † Ly(/, 7). † e-1) calculating a triple loss of the attribute space according to a formula _ § Si _ § sp Lip) -> max p, + S, (7,7, ) S, (2.7; | 0) Count s sm} s sp +3 max, +5, (7; oh ‚(7 Ay ).0) Count where Ph represents a limit of triple loss, fl ) cosine resemblance calculation

I, where K represents a feature of the k-e map in the attribute space, I? T sh where k is used as an anchor, where * represents a feature closest to the anchor K of the heterogeneous text sample, where * and

I represents characteristic furthest from the anchor k of the like

226 - T° text sample is located, where Kean attribute of the person's k-e text description represents 77 7 SH in the attribute space, where k is used as an anchor, where ~ * T° represents an attribute closest to the anchor k of the heterogeneous text sample IL” I; and where “© represents a feature furthest from the anchor K of the similar text sample; Lo a, dp e-2) calculating a cosine resemblance between '* and * according to a formula a, *a ST) TT |z|] da a a a I Te .. . †

h MH where h h respectively represent an image attribute with semantic information and a text attribute with semantic information obtained after attribute information fusion in the attribute space; and e-3) calculating a loss function ark ’ ) of the attribute space according to LT I L LT)+L LT air ( 3 ) - trip | > )+ corar ( ? ) a formula n

The method for re-identifying persons across multiple modalities based on two-part attribute information according to claim 1, wherein the step f) comprises the following steps: f-1) calculating a loss function LU, T) of a two-part LLT) = L CN +L (I.T attribute network according to a formula ( ' ) latent ( ' ) attr ( ' ) : f-2) calculating a similarity | kok ) between dual attributes according to Al, T)=A4(l, 1; )+4 (a, ay) 4 a formula ‚ where a

_27- computed similarity between the attributes Te?" Te represents learned from the shared subspace and where 4 represents a computed similarity between the attributes a, 547 , , y h k learned from the attribute space; and f-3) the computation of a fit accuracy across multiple modalities based on the similarity (ke 0) :