NL2028092B1 - Cross-modality person re-identification method based on dual-attribute information - Google Patents
Cross-modality person re-identification method based on dual-attribute information Download PDFInfo
- Publication number
- NL2028092B1 NL2028092B1 NL2028092A NL2028092A NL2028092B1 NL 2028092 B1 NL2028092 B1 NL 2028092B1 NL 2028092 A NL2028092 A NL 2028092A NL 2028092 A NL2028092 A NL 2028092A NL 2028092 B1 NL2028092 B1 NL 2028092B1
- Authority
- NL
- Netherlands
- Prior art keywords
- attribute
- feature
- image
- text
- person
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
- G06V40/25—Recognition of walking or running movements, e.g. gait recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/103—Static body considered as a whole, e.g. static pedestrian or occupant recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/10—Interfaces, programming languages or software development kits, e.g. for simulating neural networks
- G06N3/105—Shells for specifying net layout
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure provides a cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross-modality person re-identification network based on hidden space and attribute space is constructed to improve semantic expressiveness of a feature extracted by using a model. To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and make full use of the attribute information of a person.
Description
-1- CROSS-MODALITY PERSON RE-IDENTIFICATION METHOD BASED ON DUAL-
TECHNICAL FIELD The present disclosure relates to the fields of computer vision and deep learning, and specifically, to a cross-modality person re-identification method based on dual- attribute information.
BACKGROUND In the information age, video surveillance plays an invaluablerole in maintaining public safety. Person re-identification is a crucial subtask in a video surveillance scenario, and is intended to find photos of a same person from image data generated by different surveillance cameras. Public safety monitoring facilities are increasingly widely applied, resulting in massive image data collection. How to quickly and accurately find a target person in the massive image data is a research hotspot in the field of computer vision. However, in some specific emergency scenarios, an image matching a to-be-found person cannot be provided in time as a basis for retrieval, and only an oral description can be provided. Therefore, cross-modality person re- identification based on a text description emerges. Cross-modality person re-identification is to find, in an image library based on a natural language description of a person, an image most conforming to text description information. With the development of deep learning technologies and their superior performance in different tasks, researchers have proposed some deep learning-related cross-modality person re-identification algorithms. These algorithms can be roughly classified into: 1} a semantic intimacy value calculation method, which is used to calculate an intimacy value of a semantic association between an image and text, to improve intimacy between an image and text that belong to a same class, and reduce intimacy between an image and text that belong to different classes; and 2) a subspace method, which is intended to establish shared feature expression space for images and text, and uses a metric learning strategy in the shared feature expression space to decrease a distance between image and text features belonging to a same person identity (ID) and to increase a distance between image and text features belonging to different person IDs. However, semantic expressiveness of features extracted by using these methods still needs to be improved. These methods
-2- ignore or do not fully consider effectiveness of using attribute information of persons to express semantic concepts.
SUMMARY To overcome disadvantages of the above technology, the present disclosure provides a cross-modality person re-identification method by using a space construction and attribute fusion algorithm based on text and image attributes. The technical solution used in the present disclosure to resolve the technical problem thereof is as follows: A cross-modality person re-identification method based on dual-attribute information includes the following steps: a) extracting a text description feature 7 and an image feature { of a person from content obtained by a surveillance camera; Cr b) extracting a text attribute feature from an extracted text description of
C the person, and extracting an image attribute feature from an extracted image; ¢) inputting the text description feature and the image feature of the person in the step a} to shared subspace, calculating a triplet loss function of a hard sample, and calculating a classification loss of a feature in the shared subspace by using a Softmax loss function; d) fusing the text description feature 7 and the image feature { of the person Cr ¢; with the text attribute feature and the image attribute feature ; e) constructing feature attribute space based on attribute information; and f) retrieving and matching the extracted image feature and text description feature of the person. Further, the extracting a text description feature of a person in the step a) includes the following steps:
-3- a-1.1) segmenting words in a description statement of the content obtained by the surveillance camera, and establishing a word frequency table; a-1.2) filtering out a low-frequency word in the word frequency table; a-1.3} performing one-hot encoding to encode a word in the word frequency table; and a-1.4) performing feature extraction on the text description of the person by using a bidirectional long short-term memory (LSTM) model.
Further, the extracting an image feature in the step a) includes the following steps: a-2.1) performing feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set; and a-2.2} performing semantic segmentation on the extracted image, and performing, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.
Further, the step b) includes the following steps: b-1} preprocessing data of the text description of the person by using a natural language toolkit (NLTK) tool library, and extracting a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns; b-2) sorting the extracted noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature ‚and b-3) training the image by using a PA-100K data set, to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image Cr attribute feature . Further, the step c) includes the following steps:
-4- LT) . . (rip : c-1) calculating a triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , 1, eene Ie I where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k represents a feature of the kth text description of the person, k is used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation; lL, 1 c-2) calculating a cosine similarity between ‘* and “% according to a formula [, */ _ I; 1;
TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace;
-5- pe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace. . . L, tent (1, T) 5 c-4) calculating a loss function ‘> of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ”? represents a quantity of samples in one batch.
-6- Further, the step d) includes the following steps:
L LT d-1) calculating a loss function corat ( ° ) according to a formula à i 2 Lio (/, T) — 4] P IC, Cl Vv , where the image feature { is constituted by 1, - 7 EO Jr, ©, the text description feature { of the person is constituted by , represents dimensions of Ti and Ti , and | I represents a Frobenius norm; d-2) calculating, according to a formula t =sigmoid(CxU_ + FxU.) g Í 1 ‚ weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, / represents a to-be-fused image or text feature, £ and / are projection matrices, { represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR U, er 5 function, © , : represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space; and d-3) calculating a fused feature A according to a formula A=1x|CxW,| +0-92|[FxW;| Woe RO Ello J 2 g , where , and daxda W.eR represents a projection matrix. Further, the step e) includes the following steps:
-7- : : Ly pind, T) . . e-1) calculating a triplet loss ’ of the attribute space according to a formula ’ _ s sn} K sp Lom(LT)= SY max{ p,+S, (77, | S, (1,7, ).0} I el Ts sm} s sp +3 max( py +5, (77.1, S, (7; i ).0) hel , where Pp s‚() Co represents a boundary of the triplet loss, ° represents cosine similarity [5 Il: calculation, k represents a feature of the k th image in the attribute space, kis
T SH IN used as an anchor, k represents a feature, closest to the anchor fe , of the TT I’ heterogeneous text sample, k represents a feature, farthest from the anchor k 7: , of the congeneric text sample, x represents afeature of the kth text description T° J sn of the person in the attribute space, k is used as an anchor, £ represents a 7: [7 feature, closest to the anchor k , of the heterogeneous text sample, and k T° represents a feature, farthest from the anchor k ofthe congeneric text sample; a, ar e-2)calculating a cosine similarity between * and “% according to a formula xk S,(1,.T,) = te he lele] a a a a I, 7. k “ti where It and Ty respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space; and
-8- Lom T) e-3) calculating a loss function 77 of the attribute space according to L (IT I L I.TY+L LT air ( > ) - arip » )+ cora ( » ) a formula n . Further, the step f} includes the following steps: L{,T f-1) calculating a loss function ( > ) of a dual-attribute network according to a formula L({, T) = Latent (1, 7) + Lin: (/. T) . ALT) f-2) calculating a similarity between dual attributes according to a AULT) = 41, 15 )+A (a .a;) 4, formula ' ' ’ , where represents a calculated similarity between the features > learned from the Ae shared subspace, and © represents a calculated similarity between the features a 7 - a, . k learned from the attribute space; and f-3) calculating cross-modality matching accuracy based on the similarity Al, T,) The present disclosure has the following beneficial effects: The cross-modality person re-identification method based on dual-attribute information extracts rich semantic information by making full use of data of two modalities.
A space construction and attribute fusion algorithm based on text and image attributes is provided.
An end-to-end cross-modality person re-identification network based on hidden space and the attribute space is constructed to improve semantic expressiveness of a feature extracted by using the model.
To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the
-9- attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and making full use of the attribute information of the person.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of the present disclosure; FIG. 2 shows changes of a loss function in a model training process according to the present disclosure; and FIG. 3 compares the method in the present disclosure and an existing method in terms of Top-k on a CUHK-PEDES data set.
DETAILED DESCRIPTION OF THE EMBODIMENTS The present disclosure is further described with reference to FIG. 1, FIG. 2, and FIG. 3. As shown in FIG. 1, a cross-modality person re-identification method based on dual-attribute information includes the following steps. a) Extract a text description feature 7’ and an image feature 7 of a person from content obtained by a surveillance camera. The present disclosure is intended to establish a semantic association between an image captured by the surveillance camera for a person in a real scenario and a corresponding text description of the person. Feature representations of data of two modalities need to be extracted separately. The image feature is extracted by using a currently popular convolution neural network ResNet, and the text feature is extracted by using a bidirectional LSTM, so that text context information can be fully obtained. Cr b) Extract a text attribute feature from an extracted text description of the cj person, and extract an image attribute feature from an extracted image. To resolve a problem that a semantic expressiveness of a feature is poor because an existing method does not make full use of attribute information, the present disclosure is designed to use attribute information of the person as auxiliary information to improve the semantic expressiveness of image and text features. An image attribute of the person is extracted by using an existing stable person-specific
-10- image attribute extraction model. A text attribute of the person comes from statistical information in a data set, and a noun phrase with a relatively high word freguency in the data set is used as the text attribute of the person in the present disclosure.
c} Input the text description feature and the image feature of the person in the step a) to shared subspace, calculate a triplet loss function of a hard sample, and calculate a classification loss of a feature in the shared subspace by using a Softmax loss function. Projection to shared vector space is a frequently used method for resolving a cross-modality retrieval problem. In the shared vector space, an association between data of two modalities can be established. The present disclosure projects the extracted image and text features to the shared vector subspace, and adopts metric learning to decrease a distance between image and text features with same person information and increase a distance between image and text features belonging to different persons. The present disclosure uses a triplet loss of the hard sample to achieve the above purpose. That is, in a batch of data, a heterogeneous sample of another modality and closest to anchor data, and a congeneric sample of the another modality and farthest from the anchor data need to be found.
d) Fuse the text description feature 7 and the image feature { of the person . . Cr . : C with the text attribute feature and the image attribute feature ‚ The existing method does not make full use of an auxiliary function of the attribute information or uses only attribute information of one modality, resulting in poor semantic expressiveness of a feature that can be extracted by using the model. To resolve this problem, the present disclosure uses the extracted dual-attribute information, namely, the image and text attributes. Considering that different attributes play different roles in image and text matching of the person, the present disclosure uses a weight mechanism to enable semantic information to play a more important role in feature fusion. The present disclosure uses a strategy of matrix projection to project to-be-fused image and text features and attribute features to same dimensional space, and then weights the two types of features to obtain image
-11- and text features fused with the semantic information. Before feature fusion, to avoid a large difference between distribution of features of two modalities, the present disclosure uses a frequently used loss function coral to minimize a difference between distribution of data of two modalities.
e) Construct feature attribute space based on the attribute information, which is referred to as attribute space in the present disclosure. The image and text features fused with the semantic information are also sent to the shared subspace. In the present disclosure, the image and text features with the same person information have a same semantic meaning by default. In the attribute space, the present disclosure still uses the triplet loss of the hard sample to establish a semantic association between image and text features that are of the person and are of different modalities.
f) Retrieve and match the extracted image feature and text description feature of the person. The finally extracted image and text features in the present disclosure include features extracted from hidden space and features extracted from the attribute space. When the extracted model features are retrieved and matched, a cosine distance is used to calculate a distance between two model features in feature space, to measure their similarity. To make ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space complementary, the present disclosure adds up similarity matrices of the two types of features before sorting.
To resolve a problem that the existing cross-modality person re-identification method cannot effectively use the attribute information of the person as the auxiliary information to improve the semantic expressiveness of the image and text features, the present disclosure provides an efficient cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross- modality person re-identification network based on the hidden space and the attribute space is constructed to improve the semantic expressiveness of the feature extracted by using the model. To resolve a problem of cross-modality person re- identification based on an image and text, a new end-to-end cross-modality person
-12- identification network based on the hidden space and the attribute space is proposed, to greatly improve the semantic expressiveness of the extracted feature and make full use of the attribute information of the person.
Embodiment 1 The extracting a text description feature of a person in the step a} includes the following steps: a-1.1) Preprocess text information when performing feature extraction on text of the person, in other words, segment words in a description statement of the content obtained by the surveillance camera, and establish a word frequency table.
a-1.2) Filter out a low-frequency word in the word frequency table.
a-1.3) Perform one-hot encoding to encode a word in the word frequency table.
a-1.4) Perform feature extraction on the text description of the person by using a bidirectional LSTM model. The bidirectional LSTM model can fully consider a context of each word, so that richer text features are learned.
The extracting an image feature in the step a) includes the following steps: a-2.1) Perform feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set.
a-2.2} Perform semantic segmentation on the extracted image, and perform, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.
Embodiment 2 Many efforts have been made for person-specific image attribute identification, and good effects have been achieved. The present disclosure uses a stable person- specific attribute identification model to extract an attribute contained in an image of a person and a possible attribute value in the data set. The step b) includes the following steps: b-1) Preprocess data of the text description of the person by using an NLTK tool library, and extract a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns.
-13- b-2} Sort the extracted noun phrases based on a word freguency, discard a low- freguency phrase, and construct an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature . b-3) Train the image by using a PA-100K data set, to obtain 26 prediction values, and mark an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image attribute feature C, | Embodiment 3 The present disclosure uses a frequently used shared subspace method in the field of cross-modality person re-identification to establish an association between feature vectors of two modalities.
The hidden space is set to enable both the image feature and the text feature of the person to have separability of a person ID, and to enable the image and text features to have a basic semantic association.
Considering that, in cross-modality person-specific image and text retrieval, a same person ID corresponds to a plurality of images and a plurality of corresponding text descriptions, the present disclosure designs the loss function to decrease a distance between an image and a text description that belong to a same person ID, and increase a distance between an image and text that belong to different person IDs.
Specifically, data of one modality is used as an anchor.
Data that is of another modality and belongs to a type the same as that of the anchor is used as a positive sample, and data belonging to a type different from that of the anchor is used as a negative sample.
This not only realizes classification, but also establishes, to a certain extent, a correspondence between an image and a text description that have same semantics but are of different modalities.
In an experiment, an image and a text description of a same person have same semantic information by default.
The step c) includes the following steps:
-14- : Lip (1, 7) : c-1) Calculate the triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , I, kth; I, . 1’ where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k representsa feature of the k th text description of the person, kis used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation. /, be c-2) Calculate a cosine similarity between * and + according to a formula [, */ _ I; 1;
TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace.
-15- fe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace.
Laten LT) c-4) Calculate a loss function er! of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ”? represents a quantity of samples in one batch.
-16- Embodiment 4 Before the image and text features are fused with the attribute features, to avoid an excessive difference between distribution of data of two modalities, the present disclosure uses the coral function in transfer learning to decrease a distance between the data of the two modalities.
Specifically, the step d) includes the following steps: L (IT d-1) Calculate a loss function cora ’ ) according to a formula à i 2 Loret U, T) == 4] 2 IC, Cl y| , where the image feature / is constituted by Á, u , a Jr *, the text description feature { of the person is constituted by vo, ioe Sit ang i represents dimensions of © and “©, and represents a Frobenius norm, d-2) Calculate, according to a formula t = sigmoid(CxU, + FxU,) S . 3 3 , weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, I represents a to-be-fused image or text feature, & and J are projection matrices, l represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR”“ U, eR S function, © ‚ - represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space.
-17- d-3) Calculate a fused feature 4 according to a formula A=t1=|CxW,| +10) |FxW | We RV g 2 f 2 g , where 5 ‚ and daxda W,eR ” represents a projection matrix.
Embodiment 5 In the hidden space, the triplet loss is used to establish the association between the image feature and the text feature. in the attribute space, the triplet loss of the hard sample is used to establish a semantic association between features of different modalities.
Therefore, the step e) includes the following steps: . Li > (/, 1) . . e-1) Calculate a triplet loss “77% of the attribute space according to a == Ss sn} Ss Sp LoL T)=Y max p, +S. (7; 7 S, (7,7, | 0) I el Ts SH _ & Sp +> max{9, +5, (1, vi 5.7 yh ).0) formula Teel . S . , where Pp represents a boundary of the triplet loss, J ) represents cosine [5 similarity calculation, k represents a feature of the k th image in the attribute [5 T SH space, K is used as an anchor, k represents a feature, closest to the anchor I 7% k k , of the heterogeneous text sample, represents a feature, farthest from the I? Ts anchor * , of the congeneric text sample, k represents afeature of the kth text T° J Sn description of the person in the attribute space, £ is used as an anchor, £ Tk represents a feature, closest to the anchor k of the heterogeneous text sample,
-18-
1.7 1’ and £ represents a feature, farthest from the anchor ~ ©, of the congeneric text sample, a; a e-2)Calculate a cosine similarity between ‘* and “* according to a formula % S,(1,.1,) = SE Jalen | ad ad a a I 7. k “4 where Lk and Tk respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space. Lom 1. T) e-3) Calculate a loss function ¢ "7 of the attribute space according to a LT L 1.T)+L LT ain ( * ) = arin ( 2 )+ corat ( 2 ) formula n . Embodiment 6 In a process of model learning, the hidden space and the attribute space are trained at the same time. The step f) includes the following steps:
LT f-1) Calculate a loss function ( ? ) of a dual-attribute network according to LLT) = LI + LT a formula ( > ) tent 1) car ’ ) As shown in FIG. 2, change curves of the three loss functions in a training process are roughly consistent. This proves applicability and rationality of the present disclosure. f-2) To make the ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space . sr A(T) complementary in a test process, calculate a similarity * between dual attributes according to a formula
-19- ALT) =A(1, 1, )+4 (9, a) A, ‚ where represents a calculated similarity between the features Le?" Te earned from the shared subspace, and Ac represents a calculated similarity between the features a, > da, .
k 4 learned from the attribute space. f-3) Calculate cross-modality matching accuracy based on the finally obtained ALT) similarity “ It is proved that, as shown in FIG. 3, performance of the method in the present disclosure is significantly improved compared with performance of the existing five methods listed in the table. The above embodiments are only used for describing the technical solutions of the present disclosure and are not intended to limit the present disclosure. Although the present disclosure is described in detail with reference to the embodiments, those of ordinary skill in the art should understand that various modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure, and such modifications or equivalent substitutions should be encompassed within the scope of the claims of the present disclosure.
Claims (8)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010805183.XA CN112001279B (en) | 2020-08-12 | 2020-08-12 | Cross-modal pedestrian re-identification method based on dual attribute information |
Publications (2)
Publication Number | Publication Date |
---|---|
NL2028092A NL2028092A (en) | 2021-07-28 |
NL2028092B1 true NL2028092B1 (en) | 2022-04-06 |
Family
ID=73464076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
NL2028092A NL2028092B1 (en) | 2020-08-12 | 2021-04-29 | Cross-modality person re-identification method based on dual-attribute information |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112001279B (en) |
NL (1) | NL2028092B1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507853B (en) * | 2020-12-02 | 2024-05-14 | 西北工业大学 | Cross-modal pedestrian re-recognition method based on mutual attention mechanism |
CN114612927B (en) * | 2020-12-09 | 2023-05-09 | 四川大学 | Pedestrian re-recognition method based on image text double-channel combination |
CN113627151B (en) * | 2021-10-14 | 2022-02-22 | 北京中科闻歌科技股份有限公司 | Cross-modal data matching method, device, equipment and medium |
CN114036336A (en) * | 2021-11-15 | 2022-02-11 | 上海交通大学 | Semantic division-based pedestrian image searching method based on visual text attribute alignment |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9400925B2 (en) * | 2013-11-15 | 2016-07-26 | Facebook, Inc. | Pose-aligned networks for deep attribute modeling |
GB201703602D0 (en) * | 2017-03-07 | 2017-04-19 | Selerio Ltd | Multi-Modal image search |
CN107562812B (en) * | 2017-08-11 | 2021-01-15 | 北京大学 | Cross-modal similarity learning method based on specific modal semantic space modeling |
CN109344266B (en) * | 2018-06-29 | 2021-08-06 | 北京大学深圳研究生院 | Dual-semantic-space-based antagonistic cross-media retrieval method |
US11138469B2 (en) * | 2019-01-15 | 2021-10-05 | Naver Corporation | Training and using a convolutional neural network for person re-identification |
CN109829430B (en) * | 2019-01-31 | 2021-02-19 | 中科人工智能创新技术研究院(青岛)有限公司 | Cross-modal pedestrian re-identification method and system based on heterogeneous hierarchical attention mechanism |
CN110021051B (en) * | 2019-04-01 | 2020-12-15 | 浙江大学 | Human image generation method based on generation of confrontation network through text guidance |
CN110321813B (en) * | 2019-06-18 | 2023-06-20 | 南京信息工程大学 | Cross-domain pedestrian re-identification method based on pedestrian segmentation |
CN110909605B (en) * | 2019-10-24 | 2022-04-26 | 西北工业大学 | Cross-modal pedestrian re-identification method based on contrast correlation |
-
2020
- 2020-08-12 CN CN202010805183.XA patent/CN112001279B/en active Active
-
2021
- 2021-04-29 NL NL2028092A patent/NL2028092B1/en active
Also Published As
Publication number | Publication date |
---|---|
CN112001279A (en) | 2020-11-27 |
NL2028092A (en) | 2021-07-28 |
CN112001279B (en) | 2022-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
NL2028092B1 (en) | Cross-modality person re-identification method based on dual-attribute information | |
Srihari | Automatic indexing and content-based retrieval of captioned images | |
CN104063683B (en) | Expression input method and device based on face identification | |
US8024343B2 (en) | Identifying unique objects in multiple image collections | |
CN110826337A (en) | Short text semantic training model obtaining method and similarity matching algorithm | |
US20070286497A1 (en) | System and Method for Comparing Images using an Edit Distance | |
CN114743020B (en) | Food identification method combining label semantic embedding and attention fusion | |
EP2005366A2 (en) | Forming connections between image collections | |
Carneiro et al. | A database centric view of semantic image annotation and retrieval | |
CN113688894A (en) | Fine-grained image classification method fusing multi-grained features | |
CN111046732A (en) | Pedestrian re-identification method based on multi-granularity semantic analysis and storage medium | |
CN114036336A (en) | Semantic division-based pedestrian image searching method based on visual text attribute alignment | |
CN114611672B (en) | Model training method, face recognition method and device | |
CN112990120B (en) | Cross-domain pedestrian re-identification method using camera style separation domain information | |
WO2006122164A2 (en) | System and method for enabling the use of captured images through recognition | |
CN113177612A (en) | Agricultural pest image identification method based on CNN few samples | |
CN112347223A (en) | Document retrieval method, document retrieval equipment and computer-readable storage medium | |
CN111783903A (en) | Text processing method, text model processing method and device and computer equipment | |
CN116152870A (en) | Face recognition method, device, electronic equipment and computer readable storage medium | |
CN112463922A (en) | Risk user identification method and storage medium | |
CN113158777A (en) | Quality scoring method, quality scoring model training method and related device | |
CN107273859B (en) | Automatic photo marking method and system | |
CN111260114A (en) | Low-frequency confusable criminal name prediction method for integrating case auxiliary sentence | |
CN113157974B (en) | Pedestrian retrieval method based on text expression | |
Sahbi et al. | From coarse to fine skin and face detection |