NL2028092B1 - Cross-modality person re-identification method based on dual-attribute information - Google Patents

Cross-modality person re-identification method based on dual-attribute information Download PDF

Info

Publication number
NL2028092B1
NL2028092B1 NL2028092A NL2028092A NL2028092B1 NL 2028092 B1 NL2028092 B1 NL 2028092B1 NL 2028092 A NL2028092 A NL 2028092A NL 2028092 A NL2028092 A NL 2028092A NL 2028092 B1 NL2028092 B1 NL 2028092B1
Authority
NL
Netherlands
Prior art keywords
attribute
feature
image
text
person
Prior art date
Application number
NL2028092A
Other languages
Dutch (nl)
Other versions
NL2028092A (en
Inventor
Wang Yinglong
Song Xuemeng
Gao Zan
Nie Liqiang
Chen Lin
Original Assignee
Shandong Artificial Intelligence Inst
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Artificial Intelligence Inst filed Critical Shandong Artificial Intelligence Inst
Publication of NL2028092A publication Critical patent/NL2028092A/en
Application granted granted Critical
Publication of NL2028092B1 publication Critical patent/NL2028092B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • G06V40/25Recognition of walking or running movements, e.g. gait recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • G06N3/105Shells for specifying net layout

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides a cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross-modality person re-identification network based on hidden space and attribute space is constructed to improve semantic expressiveness of a feature extracted by using a model. To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and make full use of the attribute information of a person.

Description

-1- CROSS-MODALITY PERSON RE-IDENTIFICATION METHOD BASED ON DUAL-
ATTRIBUTE INFORMATION
TECHNICAL FIELD The present disclosure relates to the fields of computer vision and deep learning, and specifically, to a cross-modality person re-identification method based on dual- attribute information.
BACKGROUND In the information age, video surveillance plays an invaluablerole in maintaining public safety. Person re-identification is a crucial subtask in a video surveillance scenario, and is intended to find photos of a same person from image data generated by different surveillance cameras. Public safety monitoring facilities are increasingly widely applied, resulting in massive image data collection. How to quickly and accurately find a target person in the massive image data is a research hotspot in the field of computer vision. However, in some specific emergency scenarios, an image matching a to-be-found person cannot be provided in time as a basis for retrieval, and only an oral description can be provided. Therefore, cross-modality person re- identification based on a text description emerges. Cross-modality person re-identification is to find, in an image library based on a natural language description of a person, an image most conforming to text description information. With the development of deep learning technologies and their superior performance in different tasks, researchers have proposed some deep learning-related cross-modality person re-identification algorithms. These algorithms can be roughly classified into: 1} a semantic intimacy value calculation method, which is used to calculate an intimacy value of a semantic association between an image and text, to improve intimacy between an image and text that belong to a same class, and reduce intimacy between an image and text that belong to different classes; and 2) a subspace method, which is intended to establish shared feature expression space for images and text, and uses a metric learning strategy in the shared feature expression space to decrease a distance between image and text features belonging to a same person identity (ID) and to increase a distance between image and text features belonging to different person IDs. However, semantic expressiveness of features extracted by using these methods still needs to be improved. These methods
-2- ignore or do not fully consider effectiveness of using attribute information of persons to express semantic concepts.
SUMMARY To overcome disadvantages of the above technology, the present disclosure provides a cross-modality person re-identification method by using a space construction and attribute fusion algorithm based on text and image attributes. The technical solution used in the present disclosure to resolve the technical problem thereof is as follows: A cross-modality person re-identification method based on dual-attribute information includes the following steps: a) extracting a text description feature 7 and an image feature { of a person from content obtained by a surveillance camera; Cr b) extracting a text attribute feature from an extracted text description of
C the person, and extracting an image attribute feature from an extracted image; ¢) inputting the text description feature and the image feature of the person in the step a} to shared subspace, calculating a triplet loss function of a hard sample, and calculating a classification loss of a feature in the shared subspace by using a Softmax loss function; d) fusing the text description feature 7 and the image feature { of the person Cr ¢; with the text attribute feature and the image attribute feature ; e) constructing feature attribute space based on attribute information; and f) retrieving and matching the extracted image feature and text description feature of the person. Further, the extracting a text description feature of a person in the step a) includes the following steps:
-3- a-1.1) segmenting words in a description statement of the content obtained by the surveillance camera, and establishing a word frequency table; a-1.2) filtering out a low-frequency word in the word frequency table; a-1.3} performing one-hot encoding to encode a word in the word frequency table; and a-1.4) performing feature extraction on the text description of the person by using a bidirectional long short-term memory (LSTM) model.
Further, the extracting an image feature in the step a) includes the following steps: a-2.1) performing feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set; and a-2.2} performing semantic segmentation on the extracted image, and performing, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.
Further, the step b) includes the following steps: b-1} preprocessing data of the text description of the person by using a natural language toolkit (NLTK) tool library, and extracting a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns; b-2) sorting the extracted noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature ‚and b-3) training the image by using a PA-100K data set, to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image Cr attribute feature . Further, the step c) includes the following steps:
-4- LT) . . (rip : c-1) calculating a triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , 1, eene Ie I where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k represents a feature of the kth text description of the person, k is used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation; lL, 1 c-2) calculating a cosine similarity between ‘* and “% according to a formula [, */ _ I; 1;
TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace;
-5- pe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace. . . L, tent (1, T) 5 c-4) calculating a loss function ‘> of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ”? represents a quantity of samples in one batch.
-6- Further, the step d) includes the following steps:
L LT d-1) calculating a loss function corat ( ° ) according to a formula à i 2 Lio (/, T) — 4] P IC, Cl Vv , where the image feature { is constituted by 1, - 7 EO Jr, ©, the text description feature { of the person is constituted by , represents dimensions of Ti and Ti , and | I represents a Frobenius norm; d-2) calculating, according to a formula t =sigmoid(CxU_ + FxU.) g Í 1 ‚ weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, / represents a to-be-fused image or text feature, £ and / are projection matrices, { represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR U, er 5 function, © , : represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space; and d-3) calculating a fused feature A according to a formula A=1x|CxW,| +0-92|[FxW;| Woe RO Ello J 2 g , where , and daxda W.eR represents a projection matrix. Further, the step e) includes the following steps:
-7- : : Ly pind, T) . . e-1) calculating a triplet loss ’ of the attribute space according to a formula ’ _ s sn} K sp Lom(LT)= SY max{ p,+S, (77, | S, (1,7, ).0} I el Ts sm} s sp +3 max( py +5, (77.1, S, (7; i ).0) hel , where Pp s‚() Co represents a boundary of the triplet loss, ° represents cosine similarity [5 Il: calculation, k represents a feature of the k th image in the attribute space, kis
T SH IN used as an anchor, k represents a feature, closest to the anchor fe , of the TT I’ heterogeneous text sample, k represents a feature, farthest from the anchor k 7: , of the congeneric text sample, x represents afeature of the kth text description T° J sn of the person in the attribute space, k is used as an anchor, £ represents a 7: [7 feature, closest to the anchor k , of the heterogeneous text sample, and k T° represents a feature, farthest from the anchor k ofthe congeneric text sample; a, ar e-2)calculating a cosine similarity between * and “% according to a formula xk S,(1,.T,) = te he lele] a a a a I, 7. k “ti where It and Ty respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space; and
-8- Lom T) e-3) calculating a loss function 77 of the attribute space according to L (IT I L I.TY+L LT air ( > ) - arip » )+ cora ( » ) a formula n . Further, the step f} includes the following steps: L{,T f-1) calculating a loss function ( > ) of a dual-attribute network according to a formula L({, T) = Latent (1, 7) + Lin: (/. T) . ALT) f-2) calculating a similarity between dual attributes according to a AULT) = 41, 15 )+A (a .a;) 4, formula ' ' ’ , where represents a calculated similarity between the features > learned from the Ae shared subspace, and © represents a calculated similarity between the features a 7 - a, . k learned from the attribute space; and f-3) calculating cross-modality matching accuracy based on the similarity Al, T,) The present disclosure has the following beneficial effects: The cross-modality person re-identification method based on dual-attribute information extracts rich semantic information by making full use of data of two modalities.
A space construction and attribute fusion algorithm based on text and image attributes is provided.
An end-to-end cross-modality person re-identification network based on hidden space and the attribute space is constructed to improve semantic expressiveness of a feature extracted by using the model.
To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the
-9- attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and making full use of the attribute information of the person.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of the present disclosure; FIG. 2 shows changes of a loss function in a model training process according to the present disclosure; and FIG. 3 compares the method in the present disclosure and an existing method in terms of Top-k on a CUHK-PEDES data set.
DETAILED DESCRIPTION OF THE EMBODIMENTS The present disclosure is further described with reference to FIG. 1, FIG. 2, and FIG. 3. As shown in FIG. 1, a cross-modality person re-identification method based on dual-attribute information includes the following steps. a) Extract a text description feature 7’ and an image feature 7 of a person from content obtained by a surveillance camera. The present disclosure is intended to establish a semantic association between an image captured by the surveillance camera for a person in a real scenario and a corresponding text description of the person. Feature representations of data of two modalities need to be extracted separately. The image feature is extracted by using a currently popular convolution neural network ResNet, and the text feature is extracted by using a bidirectional LSTM, so that text context information can be fully obtained. Cr b) Extract a text attribute feature from an extracted text description of the cj person, and extract an image attribute feature from an extracted image. To resolve a problem that a semantic expressiveness of a feature is poor because an existing method does not make full use of attribute information, the present disclosure is designed to use attribute information of the person as auxiliary information to improve the semantic expressiveness of image and text features. An image attribute of the person is extracted by using an existing stable person-specific
-10- image attribute extraction model. A text attribute of the person comes from statistical information in a data set, and a noun phrase with a relatively high word freguency in the data set is used as the text attribute of the person in the present disclosure.
c} Input the text description feature and the image feature of the person in the step a) to shared subspace, calculate a triplet loss function of a hard sample, and calculate a classification loss of a feature in the shared subspace by using a Softmax loss function. Projection to shared vector space is a frequently used method for resolving a cross-modality retrieval problem. In the shared vector space, an association between data of two modalities can be established. The present disclosure projects the extracted image and text features to the shared vector subspace, and adopts metric learning to decrease a distance between image and text features with same person information and increase a distance between image and text features belonging to different persons. The present disclosure uses a triplet loss of the hard sample to achieve the above purpose. That is, in a batch of data, a heterogeneous sample of another modality and closest to anchor data, and a congeneric sample of the another modality and farthest from the anchor data need to be found.
d) Fuse the text description feature 7 and the image feature { of the person . . Cr . : C with the text attribute feature and the image attribute feature ‚ The existing method does not make full use of an auxiliary function of the attribute information or uses only attribute information of one modality, resulting in poor semantic expressiveness of a feature that can be extracted by using the model. To resolve this problem, the present disclosure uses the extracted dual-attribute information, namely, the image and text attributes. Considering that different attributes play different roles in image and text matching of the person, the present disclosure uses a weight mechanism to enable semantic information to play a more important role in feature fusion. The present disclosure uses a strategy of matrix projection to project to-be-fused image and text features and attribute features to same dimensional space, and then weights the two types of features to obtain image
-11- and text features fused with the semantic information. Before feature fusion, to avoid a large difference between distribution of features of two modalities, the present disclosure uses a frequently used loss function coral to minimize a difference between distribution of data of two modalities.
e) Construct feature attribute space based on the attribute information, which is referred to as attribute space in the present disclosure. The image and text features fused with the semantic information are also sent to the shared subspace. In the present disclosure, the image and text features with the same person information have a same semantic meaning by default. In the attribute space, the present disclosure still uses the triplet loss of the hard sample to establish a semantic association between image and text features that are of the person and are of different modalities.
f) Retrieve and match the extracted image feature and text description feature of the person. The finally extracted image and text features in the present disclosure include features extracted from hidden space and features extracted from the attribute space. When the extracted model features are retrieved and matched, a cosine distance is used to calculate a distance between two model features in feature space, to measure their similarity. To make ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space complementary, the present disclosure adds up similarity matrices of the two types of features before sorting.
To resolve a problem that the existing cross-modality person re-identification method cannot effectively use the attribute information of the person as the auxiliary information to improve the semantic expressiveness of the image and text features, the present disclosure provides an efficient cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross- modality person re-identification network based on the hidden space and the attribute space is constructed to improve the semantic expressiveness of the feature extracted by using the model. To resolve a problem of cross-modality person re- identification based on an image and text, a new end-to-end cross-modality person
-12- identification network based on the hidden space and the attribute space is proposed, to greatly improve the semantic expressiveness of the extracted feature and make full use of the attribute information of the person.
Embodiment 1 The extracting a text description feature of a person in the step a} includes the following steps: a-1.1) Preprocess text information when performing feature extraction on text of the person, in other words, segment words in a description statement of the content obtained by the surveillance camera, and establish a word frequency table.
a-1.2) Filter out a low-frequency word in the word frequency table.
a-1.3) Perform one-hot encoding to encode a word in the word frequency table.
a-1.4) Perform feature extraction on the text description of the person by using a bidirectional LSTM model. The bidirectional LSTM model can fully consider a context of each word, so that richer text features are learned.
The extracting an image feature in the step a) includes the following steps: a-2.1) Perform feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set.
a-2.2} Perform semantic segmentation on the extracted image, and perform, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.
Embodiment 2 Many efforts have been made for person-specific image attribute identification, and good effects have been achieved. The present disclosure uses a stable person- specific attribute identification model to extract an attribute contained in an image of a person and a possible attribute value in the data set. The step b) includes the following steps: b-1) Preprocess data of the text description of the person by using an NLTK tool library, and extract a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns.
-13- b-2} Sort the extracted noun phrases based on a word freguency, discard a low- freguency phrase, and construct an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature . b-3) Train the image by using a PA-100K data set, to obtain 26 prediction values, and mark an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image attribute feature C, | Embodiment 3 The present disclosure uses a frequently used shared subspace method in the field of cross-modality person re-identification to establish an association between feature vectors of two modalities.
The hidden space is set to enable both the image feature and the text feature of the person to have separability of a person ID, and to enable the image and text features to have a basic semantic association.
Considering that, in cross-modality person-specific image and text retrieval, a same person ID corresponds to a plurality of images and a plurality of corresponding text descriptions, the present disclosure designs the loss function to decrease a distance between an image and a text description that belong to a same person ID, and increase a distance between an image and text that belong to different person IDs.
Specifically, data of one modality is used as an anchor.
Data that is of another modality and belongs to a type the same as that of the anchor is used as a positive sample, and data belonging to a type different from that of the anchor is used as a negative sample.
This not only realizes classification, but also establishes, to a certain extent, a correspondence between an image and a text description that have same semantics but are of different modalities.
In an experiment, an image and a text description of a same person have same semantic information by default.
The step c) includes the following steps:
-14- : Lip (1, 7) : c-1) Calculate the triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , I, kth; I, . 1’ where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k representsa feature of the k th text description of the person, kis used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation. /, be c-2) Calculate a cosine similarity between * and + according to a formula [, */ _ I; 1;
TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace.
-15- fe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace.
Laten LT) c-4) Calculate a loss function er! of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ”? represents a quantity of samples in one batch.
-16- Embodiment 4 Before the image and text features are fused with the attribute features, to avoid an excessive difference between distribution of data of two modalities, the present disclosure uses the coral function in transfer learning to decrease a distance between the data of the two modalities.
Specifically, the step d) includes the following steps: L (IT d-1) Calculate a loss function cora ’ ) according to a formula à i 2 Loret U, T) == 4] 2 IC, Cl y| , where the image feature / is constituted by Á, u , a Jr *, the text description feature { of the person is constituted by vo, ioe Sit ang i represents dimensions of © and “©, and represents a Frobenius norm, d-2) Calculate, according to a formula t = sigmoid(CxU, + FxU,) S . 3 3 , weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, I represents a to-be-fused image or text feature, & and J are projection matrices, l represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR”“ U, eR S function, © ‚ - represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space.
-17- d-3) Calculate a fused feature 4 according to a formula A=t1=|CxW,| +10) |FxW | We RV g 2 f 2 g , where 5 ‚ and daxda W,eR ” represents a projection matrix.
Embodiment 5 In the hidden space, the triplet loss is used to establish the association between the image feature and the text feature. in the attribute space, the triplet loss of the hard sample is used to establish a semantic association between features of different modalities.
Therefore, the step e) includes the following steps: . Li > (/, 1) . . e-1) Calculate a triplet loss “77% of the attribute space according to a == Ss sn} Ss Sp LoL T)=Y max p, +S. (7; 7 S, (7,7, | 0) I el Ts SH _ & Sp +> max{9, +5, (1, vi 5.7 yh ).0) formula Teel . S . , where Pp represents a boundary of the triplet loss, J ) represents cosine [5 similarity calculation, k represents a feature of the k th image in the attribute [5 T SH space, K is used as an anchor, k represents a feature, closest to the anchor I 7% k k , of the heterogeneous text sample, represents a feature, farthest from the I? Ts anchor * , of the congeneric text sample, k represents afeature of the kth text T° J Sn description of the person in the attribute space, £ is used as an anchor, £ Tk represents a feature, closest to the anchor k of the heterogeneous text sample,
-18-
1.7 1’ and £ represents a feature, farthest from the anchor ~ ©, of the congeneric text sample, a; a e-2)Calculate a cosine similarity between ‘* and “* according to a formula % S,(1,.1,) = SE Jalen | ad ad a a I 7. k “4 where Lk and Tk respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space. Lom 1. T) e-3) Calculate a loss function ¢ "7 of the attribute space according to a LT L 1.T)+L LT ain ( * ) = arin ( 2 )+ corat ( 2 ) formula n . Embodiment 6 In a process of model learning, the hidden space and the attribute space are trained at the same time. The step f) includes the following steps:
LT f-1) Calculate a loss function ( ? ) of a dual-attribute network according to LLT) = LI + LT a formula ( > ) tent 1) car ’ ) As shown in FIG. 2, change curves of the three loss functions in a training process are roughly consistent. This proves applicability and rationality of the present disclosure. f-2) To make the ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space . sr A(T) complementary in a test process, calculate a similarity * between dual attributes according to a formula
-19- ALT) =A(1, 1, )+4 (9, a) A, ‚ where represents a calculated similarity between the features Le?" Te earned from the shared subspace, and Ac represents a calculated similarity between the features a, > da, .
k 4 learned from the attribute space. f-3) Calculate cross-modality matching accuracy based on the finally obtained ALT) similarity “ It is proved that, as shown in FIG. 3, performance of the method in the present disclosure is significantly improved compared with performance of the existing five methods listed in the table. The above embodiments are only used for describing the technical solutions of the present disclosure and are not intended to limit the present disclosure. Although the present disclosure is described in detail with reference to the embodiments, those of ordinary skill in the art should understand that various modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure, and such modifications or equivalent substitutions should be encompassed within the scope of the claims of the present disclosure.

Claims (8)

-20- Conclusies L Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie, die de volgende stappen omvat: a) het, uit inhoud die door een bewakingscamera verkregen wordt, uitlichten van een tekstbeschrijvingskenmerk 7" en een afbeeldingskenmerk / van een persoon; b) het, uit een uitgelichte tekstbeschrijving van de persoon, uitlichten van een Cr l Ln Ln tekstattribuutkenmerk ‚ en het, uit een uitgelichte afbeelding, uitlichten van een C, afbeeldingsattribuutkenmerk ; c) het, in gedeelde subruimte, invoeren van de tekstbeschrijvingskenmerk en het afbeeldingskenmerk van de persoon in de stap a), het berekenen van een drietalverliesfunctie van een hard monster en het, met behulp van een Softmax- verliesfunctie, berekenen van een classificatieverlies van een kenmerk in de gedeelde subruimte; d) het samenvoegen van het tekstbeschrijvingskenmerk 7" en het afbeeldingskenmerk-20- Conclusions L Method for re-identification of persons across multiple modalities based on bipartite attribute information, comprising the following steps: a) extracting, from content obtained by a surveillance camera, a text description feature 7" and an image feature / of a person; b) extracting, from a featured text description of the person, a Cr l Ln Ln text attribute attribute and extracting, from a featured image, a C, image attribute attribute; c) entering, in shared subspace, the text description feature and the image feature of the person in step a) calculating a triple loss function of a hard sample and, using a Softmax loss function, calculating a classification loss of a feature in the shared subspace; d) concatenating the text description attribute 7" and the image attribute C I van de persoon met het tekstattribuutkenmerk Ten het C, afbeeldingsattribuutkenmerk ; e) het, op basis van attribuutinformatie, construeren van kenmerkattribuutruimte; en f) het ophalen en passen van het uitgelichte afbeeldingskenmerk en het tekstbeschrijvingskenmerk van de persoon.C I of the person with the text attribute attribute Ten the C, image attribute attribute ; e) constructing feature attribute space based on attribute information; and f) retrieving and fitting the highlighted image feature and the text description feature of the person. 2. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 1, waarbij het uitlichten van een tekstbeschrijvingskenmerk van een persoon in de stap a) de volgende stappen omvat: a-1.1) het segmenteren van woorden in een beschrijvingsvermelding van de inhoud die door de bewakingscamera verkregen wordt, en het tot stand brengen van een woordfrequentietabel, a-1.2) het uitfilteren van een laagfrequent woord in de woordfrequentietabel, a-1.3) het uitvoeren van één-heet- (“one-hot-“) codering om een woord in deThe method of re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein extracting a text description feature of a person in step a) comprises the steps of: a-1.1) segmenting words in a description entry of the contents obtained by the surveillance camera, and establishing a word frequency table, a-1.2) filtering out a low frequency word in the word frequency table, a-1.3) performing one-hot- “) encoding to place a word in the 221 - woordfrequentietabel te coderen; en a-1.4) het, met behulp van een lange-korte-termijngeheugen- (“Long Short Term Memory”, LSTM-) model met twee richtingen, uitvoeren van kenmerkuitlichting op de tekstbeschrijving van de persoon.221 - word frequency table to be encoded; and a-1.4) performing feature highlighting on the person's text description using a two-way Long Short Term Memory (LSTM) model. 3. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 1, waarbij het uitlichten van een afbeeldingskenmerk in de stap a) de volgende stappen omvat: a-2.1) het, met behulp van een ResNet dat vooraf getraind is op een afbeeldingsnetdataset, uitvoeren van kenmerkuitlichting op de afbeelding; en a-2.2) het uitvoeren van semantische segmentatie op de uitgelichte afbeelding, en het, met behulp van ResNet in stap a-2.1), uitvoeren van kenmerkuitlichting op een afbeelding die na semantische segmentatie verkregen wordt.The method of re-identification of persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the extraction of an image feature in the step a) comprises the following steps: a-2.1) using a ResNet that has been pre-trained is on an image net data set, performing feature extraction on the image; and a-2.2) performing semantic segmentation on the extracted image, and, using ResNet in step a-2.1), performing feature highlighting on an image obtained after semantic segmentation. 4. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 1, waarbij de stap b) de volgende stappen omvat: b-1) het, met behulp van een natuurlijketaalgereedschapsset- (“natural language toolkit”, NLTK-) gereedschapsbibliotheek, vooraf verwerken van data van de tekstbeschrijving van de persoon, en het uitlichten van een zelfstandignaamwoordfrase die uit een bijvoeglijk naamwoord plus een zelfstandig naamwoord opgebouwd is en een zelfstandignaamwoordfrase die uit een veelvoud aan op elkaar geplaatste zelfstandige naamwoorden opgebouwd is; b-2) het, op basis van een woordfrequentie, sorteren van de uitgelichte zelfstandignaamwoordfrasen, het weggooien van een laagfrequente frase, en het, met behulp van de eerste 400 zelfstandignaamwoordfrasen, construeren van een Cr … attributentabel om het tekstattribuutkenmerk verkrijgen; en b-3) het, met behulp van een PA-100K-dataset, trainen van de afbeelding om 26 voorspellingswaarden te verkrijgen, en het markeren van een afbeeldingsattribuut met een voorspellingswaarde die groter is dan 0 als 1 en een afbeeldingsattribuut met een cj voorspellingswaarde die kleiner is dan 0 als 0 om het afbeeldingsattribuutkenmerkThe method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step b) comprises the following steps: b-1) using a natural language toolkit, NLTK -) tool library, preprocessing data from the person's text description, and extracting a noun phrase made up of an adjective plus a noun and a noun phrase made up of a plurality of superimposed nouns; b-2) sorting the featured noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing, using the first 400 noun phrases, a Cr... attribute table to obtain the text attribute attribute; and b-3) using a PA-100K dataset, training the image to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a cj prediction value which is less than 0 as 0 around the image attribute attribute -22- te verkrijgen.-22- available. 5. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 1, waarbij de stap c) de volgende stappen omvat: . . L,, (1 2 T ) c-1) het berekenen van een drietalverlies van het harde monster volgens ’ ! — TH a. 'p Lip 1,7) =_ > max(, +S(1, 1; ) S(T, ).0) I el : ny p +> max(p, +S(T,.1;)=S(T,.1!'),0) Ter een formule ° ‚ waarbij k een kenmerk van de k-e afbeelding voorstelt, waarbij k als een anker . . I’ . . I, gebruikt wordt, waarbij een kenmerk voorstelt, dat het dichtstbij het anker Tr van een heterogeen tekstmonster ligt, waarbij keen kenmerk voorstelt, dat het verst weg van het anker k van een gelijksoortig tekstmonster ligt, waarbij Keen kenmerk van de k-e tekstbeschrijving van de persoon voorstelt, waarbij k als een I” anker gebruikt wordt, waarbij © een kenmerk voorstelt, dat het dichtstbij het anker T, | Ip van het heterogene tekstmonster ligt, waarbij * een kenmerk voorstelt, dat het verst weg van het anker k van de gelijksoortige tekstmonster ligt, waarbij Pr een grens van een drietalverlies voorstelt en waarbij ( ) een cosinusgelijkenisberekening voorstelt; c-2) het berekenen van een cosinusgelijkenis tussen Te on Tr volgens een formuleThe method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step c) comprises the following steps: . † L,, (1 2 T ) c-1) calculating a triple loss of the hard sample according to ’ ! — TH a. 'p Lip 1,7) =_ > max(, +S(1, 1; ) S(T, ).0) I el : ny p +> max(p, +S(T,. 1;)=S(T,.1!'),0) Ter a formula ° ‚ where k represents a feature of the k-e map, where k as an anchor . † I' . † I, is used, where a feature represents a feature closest to the anchor Tr of a heterogeneous text sample, where a feature represents a feature furthest from the anchor k of a similar text sample, where a feature of the k-e text description of the person, where k is used as an I” anchor, where © represents a feature closest to the anchor T, | Ip of the heterogeneous text sample is where * represents a feature furthest from the anchor k of the similar text sample, where Pr represents a triple loss boundary and where ( ) represents a cosine resemblance calculation; c-2) calculating a cosine similarity between Te on Tr according to a formula -23- [, */ I. 7. S ( I, i T, ) — lk k | |A en m , waarbij * een kenmerk van de k-e afbeelding in de gedeelde subruimte voorstelt en waarbij Ti een kenmerk van de k-e tekstbeschrijving van de persoon in de gedeelde subruimte voorstelt; c-3) het berekenen van een classificatieverlies cls ( k ) van het afbeeldingskenmerk k in de gedeelde subruimte volgens een formule 7 CXp (4 7 Woe + D | ex +0. Dc p I J J .. I ‚ waarbij een getransponeerd afbeeldingskenmerk in de gedeelde subruimte voorstelt, waarbij W een dIxC classificeerder voorstelt, waarbij WeR , dl een kenmerkafmeting van de gedeelde subruimte voorstelt, waarbij C informatie over klassen van kwantiteit van identiteit (“identity”, ID) van een persoon voorstelt, waarbij Jk ID-informatie van-23- [, */ I. 7. S ( I, i T, ) — lk k | |A and m , where * represents an attribute of the k-e map in the shared subspace and where Ti represents an attribute of the k-e text description of the person in the shared subspace; c-3) calculating a classification loss cls ( k ) of the image feature k in the shared subspace according to a formula 7 CXp (47 Woe + D | ex +0. Dc p I J J .. I ‚ where a transposed image feature in the shared subspace, where W represents a dIxC classifier, where WeR , dl represents an attribute size of the shared subspace, where C represents information about classes of quantity of identity (“identity”, ID) of a person, where Jk represents ID information of T voorstelt, waarbij 7 een biasvector voorstelt, waarbij © een classificatievector van de /* klasse voorstelt, waarbij J een biaswaarde van de j-e klasse voorstelt, .. We . . .T represents, where 7 represents a bias vector, where © represents a classification vector of the /* class, where J represents a bias value of the j-th class, .. We . † † waarbij een overeenkomstige classificatievector van de vk-e klasse voorstelt en waarbij * gen biaswaarde van de Vvk-e klasse voorstelt, en het berekenen van een classificatieverliesfunctie as 0) van het tekstbeschrijvingskenmerk k van de persoon in de gedeelde subruimte volgens een formulewhere represents a corresponding classification vector of the vk-e class and where * represents gene bias value of the Vvk-e class, and calculating a classification loss function as 0) of the text description feature k of the person in the shared subspace according to a formula 24. 7 exp; Wa +b,] | Ly (Ti) = wap TW ) €X . +0, T ‚ waarbij k een getransponeerd tekstkenmerk in de gedeelde subruimte voorstelt; en c-4) het berekenen van een verliesfunctie aen! ( ’ ) van de gedeelde subruimte volgens een formule 7 1 1 latent (7, 1) - Lip (J, T) +— > (Ls Ji) + Los (7; ) hn nk waarbij een kwantiteit monsters in één batch voorstelt.24.7 exp; Wa +b,] | Ly (Ti) = wap TW ) €X . +0, T ‚ where k represents a transposed text attribute in the shared subspace; and c-4) calculating a loss function aen! ( ' ) of the shared subspace according to a formula 7 1 1 latent (7, 1) - Lip (J, T) +— > (Ls Ji) + Los (7; ) hn nk where a quantity of samples in one batch represents. 6. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 5, waarbij de stap d) de volgende stappen omvat: d-1) het berekenen van een verliesfunctie Lora (1,7) volgens een formule 1 2 Lorat (1, 1) = 4] 2 IC, u Cy I. 7, v| ‚ | | , waarbij het afbeeldingskenmerk / uit Zx opgebouwd is, waarbij het tekstbeschrijvingskenmerk 7’ van de persoon uit J opgebouwd is, waarbij afmetingen van en - voorstellen en waarbij een Frobeniusnorm voorstelt; t =sigmoid(CxU, + FxU,) d-2) het, volgens een formule 5 : , berekenen van wegingen van het attribuutkenmerk en het afbeeldings- of tekstkenmerk tijdens kenmerkfusie, waarbij C' een te fuseren attribuutkenmerk voorstelt, waarbij 1” | CU, U, een te fuseren afbeeldings- of tekstkenmerk voorstelt, waarbij & en projectiematrices zijn, waarbij I een weging tijdens kenmerkfusie voorstelt, die verkregen wordt door het optellen van projectieresultaten en het verwerken van eenA method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 5, wherein the step d) comprises the following steps: d-1) calculating a loss function Lora (1,7) according to a formula 1 2 Lorat (1, 1) = 4] 2 IC, u Cy I. 7, v| † † , in which the image characteristic / is built up from Zx, where the text description feature 7' of the person is built up from J, where dimensions of and - represent and where a Frobenius norm is represented; t =sigmoid(CxU, + FxU,) d-2) the calculation, according to a formula 5 : , of weights of the attribute attribute and the image or text attribute during attribute fusion, where C' represents an attribute attribute to be fused, where 1” | CU, U, represents an image or text feature to be fused, where & and are projection matrices, where I represents a weighting during feature fusion obtained by summing projection results and processing a 25. sxda25. sxda U ER verkregen resultaat met behulp van een sigmoïdfunctie, ‚ waarbij daxda U, eR * een projectiematrix voorstelt, waarbij S een kwantiteit afbeeldingsattribuutklassen of tekstattribuutklassen voorstelt en waarbij da een kenmerkafmeting van de attribuutruimte voorstelt; en d-3) het berekenen van een gefuseerd kenmerk A volgens een formule fx _ | xd A=ts|CxW‚| +(=0)x|FxW | CW, eR ‚ waarbij < en daxda W, € R Co : een projectiematrix voorstelt.U ER result obtained using a sigmoid function, ‚ where daxda U, eR * represents a projection matrix, where S represents a quantity of image attribute classes or text attribute classes and where da represents an attribute size of the attribute space; and d-3) calculating a fused feature A according to a formula fx _ | xd A=ts|CxW‚| +(=0)x|FxW | CW, eR ‚ where < and daxda W, € R Co : represents a projection matrix. 7. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 1, waarbij de stap e) de volgende stappen omvat: . . Ly (/, 7) . . e-1) het berekenen van een drietalverlies van de attribuutruimte volgens een formule _ § Si _ § sp Lip) - > max p, + S, (7,7, ) S, (2.7; | 0) Tel s sm} s sp +3 max, +5, (7; oh ‚(7 Ay ).0) Tel waarbij Ph een grens van drietalverlies voorstelt, fl ) cosinusgelijkenisberekeningThe method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step e) comprises the following steps: . † Ly(/, 7). † e-1) calculating a triple loss of the attribute space according to a formula _ § Si _ § sp Lip) -> max p, + S, (7,7, ) S, (2.7; | 0) Count s sm} s sp +3 max, +5, (7; oh ‚(7 Ay ).0) Count where Ph represents a limit of triple loss, fl ) cosine resemblance calculation I voorstelt, waarbij K een kenmerk van de k-e afbeelding in de attribuutruimte voorstelt, I? T sh waarbij k als een anker gebruikt wordt, waarbij * een kenmerk voorstelt, die het I, T,® dichtstbij het anker K van het heterogene tekstmonster ligt, waarbij * enI, where K represents a feature of the k-e map in the attribute space, I? T sh where k is used as an anchor, where * represents a feature closest to the anchor K of the heterogeneous text sample, where * and I kenmerk voorstelt, die het verst weg van het anker k van het gelijksoortigeI represents characteristic furthest from the anchor k of the like 226 - T° tekstmonster ligt, waarbij Keen kenmerk van de k-e tekstbeschrijving van de persoon 77 7 SH in de attribuutruimte voorstelt, waarbij k als een anker gebruikt wordt, waarbij ~ * T° een kenmerk voorstelt, die het dichtstbij het anker k van het heterogene tekstmonster IL” I; ligt en waarbij “© een kenmerk voorstelt, die het verste weg van het anker K van het gelijksoortige tekstmonster ligt; Lo a, dp e-2) het berekenen van een cosinusgelijkenis tussen ‘* en * volgens een formule a, *a ST) TT |z|] da a a a I Te .. . .226 - T° text sample is located, where Kean attribute of the person's k-e text description represents 77 7 SH in the attribute space, where k is used as an anchor, where ~ * T° represents an attribute closest to the anchor k of the heterogeneous text sample IL” I; and where “© represents a feature furthest from the anchor K of the similar text sample; Lo a, dp e-2) calculating a cosine resemblance between '* and * according to a formula a, *a ST) TT |z|] da a a a I Te .. . † h MH waarbij hoen respectievelijk een afbeeldingskenmerk met semantische informatie en een tekstkenmerk met semantische informatie voorstellen die na attribuutinformatiefusie in de attribuutruimte verkregen worden; en e-3) het berekenen van een verliesfunctie ark ’ ) van de attribuutruimte volgens LT I L LT)+L LT air ( 3 ) - trip | > )+ corar ( ? ) een formule nh MH where h h respectively represent an image attribute with semantic information and a text attribute with semantic information obtained after attribute information fusion in the attribute space; and e-3) calculating a loss function ark ’ ) of the attribute space according to LT I L LT)+L LT air ( 3 ) - trip | > )+ corar ( ? ) a formula n 8. Werkwijze voor heridentificatie van personen over meerdere modaliteiten op basis van tweeledige-attribuutinformatie volgens conclusie 1, waarbij de stap f) de volgende stappen omvat: f-1) het berekenen van een verliesfunctie LU, T) van een tweeledige- LLT) =L CN +L (I.T attribuutnetwerk volgens een formule ( ’ ) latent ( ’ ) attr ( ’ ) : f-2) het berekenen van een gelijkenis | kok ) tussen tweeledige attributen volgens Al, T)=A4(l, 1; )+4 (a, ay) 4 een formule ‚ waarbij eenThe method for re-identifying persons across multiple modalities based on two-part attribute information according to claim 1, wherein the step f) comprises the following steps: f-1) calculating a loss function LU, T) of a two-part LLT) = L CN +L (I.T attribute network according to a formula ( ' ) latent ( ' ) attr ( ' ) : f-2) calculating a similarity | kok ) between dual attributes according to Al, T)=A4(l, 1; )+4 (a, ay) 4 a formula ‚ where a _27- berekende gelijkenis tussen de kenmerken Te?" Te voorstelt die uit de gedeelde subruimte geleerd zijn en waarbij 4 een berekende gelijkenis tussen de kenmerken a, 547 , , y h k voorstelt die uit de attribuutruimte geleerd zijn; en f-3) het berekenen van een passingsnauwkeurigheid over meerdere modaliteiten op basis van de gelijkenis ( ke 0) :_27- computed similarity between the attributes Te?" Te represents learned from the shared subspace and where 4 represents a computed similarity between the attributes a, 547 , , y h k learned from the attribute space; and f-3) the computation of a fit accuracy across multiple modalities based on the similarity (ke 0) :
NL2028092A 2020-08-12 2021-04-29 Cross-modality person re-identification method based on dual-attribute information NL2028092B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010805183.XA CN112001279B (en) 2020-08-12 2020-08-12 Cross-modal pedestrian re-identification method based on dual attribute information

Publications (2)

Publication Number Publication Date
NL2028092A NL2028092A (en) 2021-07-28
NL2028092B1 true NL2028092B1 (en) 2022-04-06

Family

ID=73464076

Family Applications (1)

Application Number Title Priority Date Filing Date
NL2028092A NL2028092B1 (en) 2020-08-12 2021-04-29 Cross-modality person re-identification method based on dual-attribute information

Country Status (2)

Country Link
CN (1) CN112001279B (en)
NL (1) NL2028092B1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507853B (en) * 2020-12-02 2024-05-14 西北工业大学 Cross-modal pedestrian re-recognition method based on mutual attention mechanism
CN114612927B (en) * 2020-12-09 2023-05-09 四川大学 Pedestrian re-recognition method based on image text double-channel combination
CN113627151B (en) * 2021-10-14 2022-02-22 北京中科闻歌科技股份有限公司 Cross-modal data matching method, device, equipment and medium
CN114036336A (en) * 2021-11-15 2022-02-11 上海交通大学 Semantic division-based pedestrian image searching method based on visual text attribute alignment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9400925B2 (en) * 2013-11-15 2016-07-26 Facebook, Inc. Pose-aligned networks for deep attribute modeling
GB201703602D0 (en) * 2017-03-07 2017-04-19 Selerio Ltd Multi-Modal image search
CN107562812B (en) * 2017-08-11 2021-01-15 北京大学 Cross-modal similarity learning method based on specific modal semantic space modeling
CN109344266B (en) * 2018-06-29 2021-08-06 北京大学深圳研究生院 Dual-semantic-space-based antagonistic cross-media retrieval method
US11138469B2 (en) * 2019-01-15 2021-10-05 Naver Corporation Training and using a convolutional neural network for person re-identification
CN109829430B (en) * 2019-01-31 2021-02-19 中科人工智能创新技术研究院(青岛)有限公司 Cross-modal pedestrian re-identification method and system based on heterogeneous hierarchical attention mechanism
CN110021051B (en) * 2019-04-01 2020-12-15 浙江大学 Human image generation method based on generation of confrontation network through text guidance
CN110321813B (en) * 2019-06-18 2023-06-20 南京信息工程大学 Cross-domain pedestrian re-identification method based on pedestrian segmentation
CN110909605B (en) * 2019-10-24 2022-04-26 西北工业大学 Cross-modal pedestrian re-identification method based on contrast correlation

Also Published As

Publication number Publication date
CN112001279A (en) 2020-11-27
NL2028092A (en) 2021-07-28
CN112001279B (en) 2022-02-01

Similar Documents

Publication Publication Date Title
NL2028092B1 (en) Cross-modality person re-identification method based on dual-attribute information
Srihari Automatic indexing and content-based retrieval of captioned images
CN104063683B (en) Expression input method and device based on face identification
US8024343B2 (en) Identifying unique objects in multiple image collections
CN110826337A (en) Short text semantic training model obtaining method and similarity matching algorithm
US20070286497A1 (en) System and Method for Comparing Images using an Edit Distance
CN114743020B (en) Food identification method combining label semantic embedding and attention fusion
EP2005366A2 (en) Forming connections between image collections
Carneiro et al. A database centric view of semantic image annotation and retrieval
CN113688894A (en) Fine-grained image classification method fusing multi-grained features
CN111046732A (en) Pedestrian re-identification method based on multi-granularity semantic analysis and storage medium
CN114036336A (en) Semantic division-based pedestrian image searching method based on visual text attribute alignment
CN114611672B (en) Model training method, face recognition method and device
CN112990120B (en) Cross-domain pedestrian re-identification method using camera style separation domain information
WO2006122164A2 (en) System and method for enabling the use of captured images through recognition
CN113177612A (en) Agricultural pest image identification method based on CNN few samples
CN112347223A (en) Document retrieval method, document retrieval equipment and computer-readable storage medium
CN111783903A (en) Text processing method, text model processing method and device and computer equipment
CN116152870A (en) Face recognition method, device, electronic equipment and computer readable storage medium
CN112463922A (en) Risk user identification method and storage medium
CN113158777A (en) Quality scoring method, quality scoring model training method and related device
CN107273859B (en) Automatic photo marking method and system
CN111260114A (en) Low-frequency confusable criminal name prediction method for integrating case auxiliary sentence
CN113157974B (en) Pedestrian retrieval method based on text expression
Sahbi et al. From coarse to fine skin and face detection