CN114036336A - Semantic division-based pedestrian image searching method based on visual text attribute alignment - Google Patents

Semantic division-based pedestrian image searching method based on visual text attribute alignment Download PDF

Info

Publication number
CN114036336A
CN114036336A CN202111344497.5A CN202111344497A CN114036336A CN 114036336 A CN114036336 A CN 114036336A CN 202111344497 A CN202111344497 A CN 202111344497A CN 114036336 A CN114036336 A CN 114036336A
Authority
CN
China
Prior art keywords
global
image
text
local
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111344497.5A
Other languages
Chinese (zh)
Inventor
杨华
杨新新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111344497.5A priority Critical patent/CN114036336A/en
Publication of CN114036336A publication Critical patent/CN114036336A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian image searching method based on semantic division and visual text attribute alignment, which comprises the following steps of: processing original data in an image mode and a text mode to obtain data sets of the image overall situation and the text local situation; respectively extracting features of the data set by using an image feature extraction network and a text feature extraction network to obtain global and local features in a single mode of the image and the text; converting global and local features in a single mode into embedded features of a corresponding mode in an embedded network; and carrying out model training under the joint constraint of a plurality of loss functions. According to the method, the local features are divided in a finer granularity, the corresponding relation among the local features is fully utilized, and a backbone network is assisted to extract more aligned global cross-modal embedded features; through training of the overall loss joint constraint network, the model is promoted to converge towards the optimal direction, and the performance of pedestrian image search based on natural language description is improved.

Description

Semantic division-based pedestrian image searching method based on visual text attribute alignment
Technical Field
The invention relates to the field of cross-modal alignment, in particular to a pedestrian image searching method based on semantic division and visual text attribute alignment.
Background
The identification technology aiming at people plays an extremely important role in the field of intelligent video monitoring. The cross-teletext retrieval of pedestrians aims at matching in the picture dataset the most conforming pedestrian picture to a given natural language description of the target pedestrian. Compared with a pedestrian image searching method based on pictures or attributes, the person image searching method based on natural language description can process the complex situation of lacking known images of persons to be searched, has the advantages of more flexibility, convenience and friendliness in application, but has the cost of more complexity and challenge in model processing.
First, for the same image, the text description format is rich, and thus the encoding network of the text needs to be able to process diversified data. And thirdly, the sample data comes from two modes of images and texts, heterogeneous gaps exist in data composition, and semantic gaps exist in characteristics. It is only possible to effectively measure and rank compare the similarity between feature vectors from two modalities by overcoming the differences between modalities.
At present, there are two main ideas: joint embedded learning and similarity learning. The former refers to learning a mapping mode, and projects the features of the image and the text to a common subspace, and the embedded features of the image and the text are directly matched in the common subspace; the latter refers to designing a similarity metric network.
The joint embedding learning is the current mainstream method, and the process comprises the steps of firstly extracting the independent features in the two modes respectively, then mapping the features into a feature public subspace (high-level network) shared by visual texts, and maximizing the correlation of different mode representations of the same sample in the public subspace. The current pedestrian image searching method based on natural language description has the following three categories:
one is based on global features (see Zhang Y, Lu h. deep cross-mode projection leaving for image-text matching [ C ]// Proceedings ofhe European Conference Component (ECCV): 2018: 686-;
second, the general grid image partitioning between the attention area with natural language description [ C ]// Proceedings of IEEE Conference on Computer and Pattern recognition [ 2017:1970 ]; the general grid image partitioning between the attention area with C70, and the attention area with C70, the text is divided into individual words or phrases. And adding an attention mechanism in the network to acquire the similarity between the local part and the local part, between the local part and the whole body and between the whole body and the whole body. However, this method has the disadvantage of only performing gridded partitioning on the image and the text, and does not align meaningful semantic parts in the two modalities. Third, the correlation between Visual and textual semantic components is fully explored (see Niu K, Huang Y, Ouyang W, et al. Improputing description-basedperson re-identification by multi-textual information-textual alignment [ J ]. IEEE Transactions on Image Processing,2020,29: 5542. J. Jing Y, Si C, Wang J, et al. position-derived joint information and attribute local alignment [ J ] for the advanced correlation index (AAAI),2020. Wang, Zhe, et al. Visual-textual alignment [ Visual-metadata ] metadata alignment [ J ] for the semantic alignment of the Visual semantic of the Image, Visual alignment, and comparison section 2020. Visual alignment. The image is segmented based on human body parts, and then the text is correspondingly divided, so that the alignment between the local features of the image and the text is promoted. However, this method is only image-based on the division of local features, and lacks a finer-grained division of text semantics.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a pedestrian image searching method based on semantic division and visual text attribute alignment.
According to one aspect of the invention, a pedestrian image searching method based on semantic division visual text attribute alignment is provided, and comprises the following steps:
processing original data of an image modality and a text modality to obtain data sets of the image overall situation and the text local situation in the corresponding modalities;
respectively extracting features of the data set by using an image feature extraction network and a text feature extraction network to obtain corresponding global and local features in a single mode;
converting global and local features in the single modality into embedded features of a corresponding modality by using an embedded network;
carrying out model training under the joint constraint of the overall loss of the model;
and searching the pedestrian image by using the trained model.
Preferably, the processing raw data in an image modality and a text modality, and the obtaining data sets of an image global and a text global and local includes:
obtaining a global image representation I of a human bodyglobalIntegral text representation Tglobal
The method comprises the following steps of dividing original data in an image modality and a text modality based on human body parts and word parts of speech, wherein the dividing comprises the following steps:
and generating an image segmentation mask based on five body parts of the human body by utilizing the existing human body segmentation network. Inputting the size-normalized image into the human body segmentation network, an image segmentation mask I based on five body parts of the human body can be generatedlocal-label
Obtaining a textual representation of each body part using a word-body part correspondence table in conjunction with the correspondence table and an existing natural language processing tool library NLTK
Figure BDA0003353490320000031
And
Figure BDA0003353490320000032
preferably, the human body part comprises: head, upper body, lower body, shoes and backpack;
the part of speech of the word comprises: nouns and adjectives;
the word-human body part corresponding table designed aiming at the invention is described;
preferably, the image feature extraction network is FvisualThe first three blocks of Resnet50, with the output dimension set to 1024;
the text feature extraction network is FtextualWhich is Bi-LSTM, the output dimension is set to 512 dimensions.
Preferably, the image feature extraction network FvisualAnd the text feature extraction network FtextualIs that the global features are shared with the local features. And respectively inputting the global image data and the global and local text data into the feature extraction network of the corresponding mode to extract the corresponding single-mode features. The global image IglobalAfter passing through the image feature extraction network FvisualPost-formation of globalImage feature vglobal. The whole text TglobalLocal description of text based on noun semantic division
Figure BDA0003353490320000033
And text local description based on adjective semantic division
Figure BDA0003353490320000034
Independently pass through FtextualThe global feature t of the corresponding text can be obtainedglobalLocal text feature based on noun semantic division
Figure BDA0003353490320000035
Text local feature based on adjective semantic division
Figure BDA0003353490320000036
Preferably, the converting, in an embedded network, the global and local features in the single modality into embedded features of a corresponding modality includes:
constructing a feature-embedded network, comprising: image global feature embedding network Evisual-globalLocal feature embedded network of image corresponding to noun semanteme
Figure BDA0003353490320000037
Local image feature embedding network corresponding to adjective semantics
Figure BDA0003353490320000038
Text global feature embedded network Etextual-globalLocal feature embedded network of text corresponding to noun semanteme
Figure BDA0003353490320000039
Local image feature embedding network corresponding to adjective semantics
Figure BDA00033534903200000310
Acquiring global and local embedded features, comprising:
global image feature vglobalEmbedding into network E through image local featurevisual-globalDeriving global embedding features vglobal-embed
Global image feature vglobalImage local feature embedded network corresponding through noun semantics
Figure BDA0003353490320000041
Obtaining local embedding characteristics v of imagenoun-embed
Global image feature vglobalImage local feature embedding network corresponding to adjective meaning
Figure BDA0003353490320000042
Obtaining local embedding characteristics v of imageadj-embed
Global feature t of textglobalEmbedding into network E through global features of texttextual-globalObtaining global embedded feature t of textglobal-embed
Local feature of text based on noun semantic division
Figure BDA0003353490320000043
Text local feature embedding network corresponding through noun semantics
Figure BDA0003353490320000044
Obtaining local embedding characteristics b of textnoun-embed
Text local feature based on adjective semantic division
Figure BDA0003353490320000045
Image local feature embedding network corresponding to adjective meaning
Figure BDA0003353490320000046
Obtaining local embedding characteristics t of textadj-embed
Local embedding features v of an imagenoun-embedAnd vadj-embedObtaining class prediction x of image local feature through deconvolution operationnounAnd xadj
Preferably, for each modality, the sub-embedded networks are independent of each other, and parameters are not shared; the first element of all image feature embedding networks is a fourth large block of Resnet50 of different parameters4). Thus, the global and local features of the image are obtained by processing the original image through complete Resnet50, and the difference is that the fourth block is independent from each other, and generates corresponding features under the constraint of corresponding functions; the first three blocks are shared by global features and local features, and the shared network enables the alignment of the local features to play a constraint on the extraction of the global features in the back propagation process.
Preferably, the overall loss comprises: global alignment loss, local alignment loss, and segmentation loss;
preferably, the function of the overall loss is:
Figure BDA0003353490320000047
Figure BDA0003353490320000048
wherein L isglobal-alignFor global alignment loss, for constraining the similarity of global embedded features between modalities;
Figure BDA0003353490320000049
and
Figure BDA00033534903200000410
respectively constraining the similarity of local embedding characteristics corresponding to nouns and adjective meanings among the modes for local alignment loss;
Figure BDA00033534903200000411
and
Figure BDA00033534903200000412
for dividing losses, useThe image local embedding characteristics extracted in the guarantee correspond to five body parts of the human body; lambda [ alpha ]1And λ2Representing the weight of the corresponding loss component.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the local features are divided in a finer granularity, the corresponding relation among the local features is fully utilized, and global branches (composed of a global feature extraction network and a global feature embedding network) are assisted to extract more aligned global cross-modal embedding features; through training of a plurality of loss functions combined constraint networks, the model is promoted to converge towards the optimal direction, and the performance of pedestrian image search based on natural language description is improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flowchart of a pedestrian image searching method based on semantic division and visual text attribute alignment according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a principle of a pedestrian image searching method based on semantic division and visual text attribute alignment according to an embodiment of the present invention;
FIG. 3 is a graph of performance of different algorithms on the data set CUHK-PEDES according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, a flowchart of a pedestrian image searching method based on semantic division and visual text attribute alignment according to an embodiment of the present invention includes:
s1, processing the original data of the image mode and the text mode to obtain data sets of the image overall situation and the text local situation in the corresponding modes;
s2, constructing an image feature extraction network and a text feature extraction network;
s3, extracting the data set by using the image feature extraction network and the text feature to obtain the global and local features in the image and text single mode;
s4, constructing an embedded network;
s5, converting global and local features in the single mode into embedded features of corresponding modes in the embedded network;
and S6, performing model training under the joint constraint of the overall loss.
To better perform data preprocessing, an embodiment is provided to perform S1. In the data preprocessing of the embodiment, a global image representation I of a human body is acquired firstglobalIntegral text representation Tglobal. Then, data division is carried out based on human body parts and word parts of speech, and specifically, an existing human body segmentation network is utilized to obtain an image segmentation mask I based on human body partslocal-label(ii) a Combining the manually designed word-part corresponding Table Table-voc and the text word segmentation model to obtain the text representation of the local part
Figure BDA0003353490320000061
And
Figure BDA0003353490320000062
s1 is performed as a preferred embodiment, including:
s101, carrying out size normalization on the whole image to obtain a whole image representation I with the size of 384 x 128global
S102, dividing the human body into five parts, namely: head, upper body, lower body, shoes and backpack. Generating image segmentation masks of five parts of a human body by using the existing human body segmentation model:
Figure BDA0003353490320000063
Figure BDA0003353490320000064
s103, manually collecting a name Table Table-voc (see Table 1) corresponding to five body parts of a human body;
s104, obtaining the part of speech of each word in the original sentence and the coordinates of each word in the sentence by using an NLTK library, searching each noun in Table-voc, and determining the corresponding body part;
specifically, in the present embodiment, it is assumed that all adjectives between the (i-1) th noun and the (i) th noun are used to modify the (i) th noun, and thus the body part corresponding to each adjective can be determined. Thereby enabling the determination of adjectives and nouns in a text description that modify each body part. Thus, local text information based on semantics and human body parts is obtained;
and S105, carrying out one-hot coding on the whole and local texts to form global and local text descriptions. The length of the unified global textual description is 100 and the length of the local textual description is 15. The text is described as T in its entirety. The corresponding local division of each finally generated text description is divided into two groups, wherein one group is a local representation corresponding to a phrase central word noun for representing the object type:
Figure BDA0003353490320000065
the other set is a corresponding local representation of an adjective describing the nature and state attributes of the thing
Figure BDA0003353490320000066
Figure BDA0003353490320000067
And S106, converting the global and local texts into vector representations through word embedding.
TABLE 1 Table-voc
Figure BDA0003353490320000068
Figure BDA0003353490320000071
Based on the above S1, S2 is executed to construct an image feature extraction network and a text feature extraction network. To better extract image features and text features, a preferred embodiment is provided. In the present embodiment, the selected image feature extraction network FvisualFor the first three blocks of Resnet50, the output dimension is set to 1024; text feature extraction network FtextualFor Bi-LSTM, the output dimension is set to 512 dimensions. Resnet can solve the problem that the gradient of a deep network disappears, and is a mainstream network in the current image processing field. The Bi-LSTM is a bidirectional long-time memory model and can capture and model context information.
Based on the S2, executing S3, generating a single-mode feature vector by using the constructed image feature extraction network and the text feature extraction network, and obtaining the global feature v of the imageglobalGlobal feature t of textglobalLocal text feature based on noun semantic division
Figure BDA0003353490320000072
Text local feature based on adjective semantic division
Figure BDA0003353490320000073
All feature extraction networks are shared globally and locally. The part only obtains global image features, global text features and local text features, namely only global features are extracted in an image mode, and then global and local embedded features are generated by the global features.
Specifically, as a preferred embodiment, executing S3 includes:
s301, the global image I processed in the S1 is processedglobalInput to the image feature extraction network FvisualTo generate a global feature v of an imageglobal
vglobal=Fvisual(Iglobal)
S302, text data T obtained in S1global
Figure BDA0003353490320000074
And
Figure BDA0003353490320000075
feature extraction network F for input to texttextualGenerating a global feature t of textglobalLocal features of text corresponding to noun semantics
Figure BDA0003353490320000076
Local features of text corresponding to adjective semantics
Figure BDA0003353490320000077
Specifically, the obtained output is:
tglobal=Ftextual(Tglobal)
Figure BDA0003353490320000078
Figure BDA0003353490320000081
based on the above S3, S4 is executed to construct a feature-embedded network. In a multi-modal task, the model performance depends largely on the quality of the embedded features. Since the present process is a crucial loop in the overall network, it is aimed at generating richer embedded tokens on the basis of the aforementioned extracted single-modal feature vectors.
The feature embedding network includes: image global feature embedding network Evisual-globalLocal feature embedded network of image corresponding to noun semanteme
Figure BDA0003353490320000082
Local image feature embedding network corresponding to adjective semantics
Figure BDA0003353490320000083
Text global feature embedded network Etextual-globalLocal feature embedded network of text corresponding to noun semanteme
Figure BDA0003353490320000084
Local text feature embedding network corresponding to adjective semantics
Figure BDA0003353490320000085
The six sub-networks of the feature embedded network are independent and connectionless.
In a preferred embodiment, the global and local embedded networks are not shared, the global features can only input the global feature embedded network of the corresponding modality, and the local features can only input the local feature embedded network of the corresponding modality. The first element of all image feature embedding networks is a fourth large block of Resnet50 of different parameters4). Thus, the global and local features of the image are obtained by processing the original image through complete Resnet50, and the difference is that the fourth block is independent from each other, and generates corresponding features under the constraint of corresponding functions; the first three blocks are shared by global features and local features, and the shared network enables the alignment of the local features to play a constraint on the extraction of the global features in the back propagation process.
Based on the above S4, S5 is performed, and an embedded feature is generated. Specifically, as a preferred embodiment, executing S5 includes:
s501, inputting the global features of the images in the individual modalities obtained in S3 into the feature embedding network E in S4visual-global
Figure BDA0003353490320000086
And
Figure BDA0003353490320000087
in the method, the global embedded characteristic v of the image is obtainedglobal-embedLocal embedding feature v of imagenoun-embedAnd vadj-embed. The input and output relations are as follows:
vglobal-embed=Evisual-global(vglobal)
=Linearvisual-global(Avgpool(block4-global(vglobal)))
Figure BDA0003353490320000088
Figure BDA0003353490320000089
vnoun-embedand vadj-embedObtaining class prediction x of image local feature through deconvolution operationnounAnd xadj. Will subsequently pass xnounAnd xadjAnd calculating the segmentation loss between the segmentation masks obtained by the human body segmentation network
Figure BDA0003353490320000091
And
Figure BDA0003353490320000092
s502, inputting the global and local characteristics of the text in the single mode obtained in S3 into the characteristic embedding network E in S4textual-global
Figure BDA0003353490320000093
And
Figure BDA0003353490320000094
in the method, the global embedded characteristic t of the text is obtainedglobal-embedLocal embedding feature t of textnoun-embedAnd tadj-embed. The input and output relationship is as follows:
t=Etextual-global(tglobal)=Lineartextual-global(tglobal)
Figure BDA0003353490320000095
Figure BDA0003353490320000096
block in the input-output formula of the embedded network4The fourth group of big blocks representing Resnet50, partname representing the corresponding human body part (head \ upper \ lower \ shoes \ backspack), Avgpool representing the average pooling operation, Maxpool representing the maximum pooling operation, and Linear representing the full link operation. Where the dimensions of both global and local embedded features are 256.
The embodiment makes full use of the text information in the division of the local features. On the basis of dividing local features according to human body parts in the prior art, according to different semantemes represented by words with different parts of speech, local information is divided into two groups corresponding to nouns and adjective word meanings, and local features based on different body parts and different semantemes with finer granularity are generated. Since the underlying network of global and local features is shared, alignment of local features may assist global branching in extracting more aligned cross-modal global embedded features through back propagation optimization iterations.
Based on the above S5, S6 is executed, and training is performed using the sum of losses L.
Specifically, as a preferred embodiment, executing S6 includes:
the overall loss function is:
Figure BDA0003353490320000097
Figure BDA0003353490320000098
wherein L isglobal-alignFor constraining the similarity of the global embedded features between modalities,
Figure BDA0003353490320000099
and
Figure BDA00033534903200000910
respectively restricting the similarity of local embedding characteristics corresponding to nouns and adjective meanings between modes,
Figure BDA00033534903200000911
and
Figure BDA00033534903200000912
the image local embedding characteristics used for guaranteeing extraction correspond to five body parts of a human body. Lambda [ alpha ]1And λ2Representing the weight of the corresponding loss component, in this embodiment λ1Take 0.80, λ2Take 0.65.
The expression for each alignment loss component is as follows:
Figure BDA00033534903200000913
Figure BDA0003353490320000101
Figure BDA0003353490320000102
Figure BDA0003353490320000103
Figure BDA0003353490320000104
wherein N represents the size of batch _ size, which is 64 in this embodiment; tau ispAnd τnThe temperature parameters of the positive and negative samples to the gradient are adjusted, all tau in this embodimentpAll take 10, all taunAll are taken as 40; siRepresenting from two modalitiesDot product of embedded features, superscript "+" indicates positive sample pairs, superscript "-" indicates negative sample pairs:
Figure BDA0003353490320000105
Figure BDA0003353490320000106
Figure BDA0003353490320000107
through the joint constraint of the five loss functions, the constraint on the model is enhanced, and the network is promoted to converge towards the optimal direction.
The invention provides a specific application embodiment for searching images of pedestrians. As shown in fig. 2, a schematic diagram of a pedestrian image searching method based on semantic division and visual text attribute alignment according to this embodiment is shown. 40206 pictures (34054 training sets, 3078 verification sets and 3074 test sets) and 80440 natural language text descriptions (68126 training sets, 6158 verification sets and 6156 test sets) are contained in the CUHK-PEDES data set, and each picture corresponds to about two natural language text descriptions on average.
Experiments prove that the method can extract more aligned global features through the auxiliary extraction of fine-grained local features, the accuracy of cross-modal retrieval is improved, and the table 2 and the attached drawing 3 show that the performance of the method in the embodiment on CUHK-PEDES is compared with that of different algorithms. FIG. 3 is a graph of Cumulative Match Characteristics (CMC), and nine curves represent the performance of the method of this embodiment and other methods from top to bottom.
TABLE 2
Figure BDA0003353490320000111
It can be seen that the result obtained by the embodiment greatly improves the performance of cross-modal pedestrian retrieval.
In summary, the embodiment method makes full use of semantic information in the text and generates finer-grained local features on the basis of the existing local features divided according to human body parts. And the corresponding relation between the local features is fully utilized, and the global branch (composed of a global feature extraction network and a global feature embedding network) is assisted to extract more aligned global cross-modal embedded features. Through the alignment constraint of the local features, the auxiliary network learns more aligned overall features; through the joint constraint of global alignment loss, local alignment loss and segmentation loss, the model converges towards the optimal direction, and the retrieval performance of pedestrian image searching based on natural language description is improved.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The above-described preferred features may be used in any combination without conflict with each other.

Claims (10)

1. A pedestrian image searching method based on semantic division visual text attribute alignment is characterized by comprising the following steps:
processing original data in an image mode and a text mode to obtain data sets of the image overall situation and the text local situation;
respectively extracting features of the data set by using an image feature extraction network and a text feature extraction network to obtain corresponding global and local features in a single mode;
converting global and local features in the single modality into embedded features of a corresponding modality by using an embedded network;
carrying out model training under the joint constraint of the overall loss of the model;
and searching the pedestrian image by using the trained model.
2. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 1, wherein the processing of the raw data in the image modality and the text modality to obtain image global and text global and local data sets comprises:
obtaining a global image representation I of a human bodyglobalIntegral text representation Tglobal
The method comprises the following steps of dividing original data in an image modality and a text modality based on human body parts and word parts of speech, wherein the dividing comprises the following steps:
generating image segmentation mask I based on human body part by using existing human body segmentation networklocal-label
Obtaining a textual representation of each body part using a word-body part correspondence table in conjunction with the correspondence table and an existing natural language processing tool library NLTK
Figure FDA0003353490310000011
And
Figure FDA0003353490310000012
3. the semantic division based visual text attribute aligned pedestrian image searching method of claim 2,
the human body part includes: head, upper body, lower body, shoes and backpack;
the part of speech of the word comprises: nouns and adjectives.
4. The semantic division based visual text attribute aligned pedestrian image searching method of claim 1, wherein the image feature extraction network is FvisualThe first three blocks of Resnet50, with the output dimension set to 1024;
the text feature extraction network is FtextualIt is Bi-LSTM, the output dimension is set to 512 dimensions;
the image characteristicsExtraction network FvisualAnd the text feature extraction network FtextualIs that the global features are shared with the local features.
5. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 4, wherein the extracting the data sets by using an image feature extraction network and a text feature extraction network respectively to obtain the corresponding global and local features in a single mode comprises:
the global image data and the global and local text data are respectively input into the feature extraction network of the corresponding mode to extract the corresponding single-mode features, and the method comprises the following steps:
the global image IglobalAfter passing through the image feature extraction network FvisualPost-forming global image features vglobal
The whole text TglobalLocal description of text based on noun semantic division
Figure FDA0003353490310000021
And text local description based on adjective semantic division
Figure FDA0003353490310000022
Independently pass through FtextualThe global feature t of the corresponding text can be obtainedglobalLocal text feature based on noun semantic division
Figure FDA0003353490310000023
Text local feature based on adjective semantic division
Figure FDA0003353490310000024
6. The semantic segmentation based visual text attribute aligned pedestrian image searching method of claim 2, wherein the feature embedding network comprises six sub-networks:
image global feature embedding network Evisual-globalLocal feature embedded network of image corresponding to noun semanteme
Figure FDA0003353490310000025
Local image feature embedding network corresponding to adjective semantics
Figure FDA0003353490310000026
Text global feature embedded network Etextual-globalLocal feature embedded network of text corresponding to noun semanteme
Figure FDA0003353490310000027
Local image feature embedding network corresponding to adjective semantics
Figure FDA0003353490310000028
The global and local multi-modal embedded features comprise:
global embedding feature vglobal-embedLocal embedding feature v of imagenoun-embedAnd vadj-embedGlobal embedded feature t of textglobal-embedLocal embedding feature t of textnoun-embedAnd tadj-embed
7. The semantic division based visual text attribute aligned pedestrian image searching method of claim 6,
the converting global and local features in the single modality into embedded features of a corresponding modality by using an embedded network includes:
global image feature vglobalEmbedding into network E through image local featurevisual-globalDeriving global embedding features vglobal-embed
Global image feature vglobalImage local feature embedded network corresponding through noun semantics
Figure FDA0003353490310000029
Obtaining local embedding characteristics v of imagenoun-embed
Global image feature vglobalImage local feature embedding network corresponding to adjective meaning
Figure FDA00033534903100000210
Obtaining local embedding characteristics v of imageadj-embed
Global feature t of textglobalEmbedding into network E through global features of texttextual-globalObtaining global embedded feature t of textglobal-embed
Local feature of text based on noun semantic division
Figure FDA0003353490310000031
Text local feature embedding network corresponding through noun semantics
Figure FDA0003353490310000032
Obtaining local embedding characteristics t of textnoun-embed
Text local feature based on adjective semantic division
Figure FDA0003353490310000033
Image local feature embedding network corresponding to adjective meaning
Figure FDA0003353490310000034
Obtaining local embedding characteristics t of textadj-embed
Local embedding features v of an imagenoun-embedAnd vadj-embedObtaining class prediction x of image local feature through deconvolution operationnounAnd xadj
8. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 6 or 7, wherein for each modality, the six sub-networks of the feature embedding network are independent of each other, and parameters are not shared;
the first link of all image features embedded in the network is a fourth large block (block) of Resnet50 with different parameters4)。
9. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 8,
the overall loss includes: global alignment loss, local alignment loss, and segmentation loss.
10. The semantic division based visual text attribute aligned pedestrian image searching method of claim 9,
the function of the overall loss is:
Figure FDA0003353490310000035
Figure FDA0003353490310000036
wherein L isglobal-alignFor global alignment loss, for constraining the similarity of global embedded features between modalities;
Figure FDA0003353490310000037
and
Figure FDA0003353490310000038
respectively constraining the similarity of local embedding characteristics corresponding to nouns and adjective meanings among the modes for local alignment loss;
Figure FDA0003353490310000039
and
Figure FDA00033534903100000310
in order to achieve the segmentation loss,the image local embedding characteristics used for guaranteeing the extracted image correspond to five body parts of a human body; lambda [ alpha ]1And λ2Representing the weight of the corresponding loss component.
CN202111344497.5A 2021-11-15 2021-11-15 Semantic division-based pedestrian image searching method based on visual text attribute alignment Pending CN114036336A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111344497.5A CN114036336A (en) 2021-11-15 2021-11-15 Semantic division-based pedestrian image searching method based on visual text attribute alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111344497.5A CN114036336A (en) 2021-11-15 2021-11-15 Semantic division-based pedestrian image searching method based on visual text attribute alignment

Publications (1)

Publication Number Publication Date
CN114036336A true CN114036336A (en) 2022-02-11

Family

ID=80137602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111344497.5A Pending CN114036336A (en) 2021-11-15 2021-11-15 Semantic division-based pedestrian image searching method based on visual text attribute alignment

Country Status (1)

Country Link
CN (1) CN114036336A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821770A (en) * 2022-04-11 2022-07-29 华南理工大学 Text-to-image cross-modal pedestrian re-identification method, system, medium, and apparatus
CN115292533A (en) * 2022-08-17 2022-11-04 苏州大学 Cross-modal pedestrian retrieval method driven by visual positioning
CN115761222A (en) * 2022-09-27 2023-03-07 阿里巴巴(中国)有限公司 Image segmentation method, remote sensing image segmentation method and device
CN116228897A (en) * 2023-03-10 2023-06-06 北京百度网讯科技有限公司 Image processing method, image processing model and training method
CN117391092A (en) * 2023-12-12 2024-01-12 中南大学 Electronic medical record multi-mode medical semantic alignment method based on contrast learning
WO2024114185A1 (en) * 2023-07-24 2024-06-06 西北工业大学 Pedestrian attribute cross-modal alignment method based on complete attribute identification enhancement

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821770A (en) * 2022-04-11 2022-07-29 华南理工大学 Text-to-image cross-modal pedestrian re-identification method, system, medium, and apparatus
CN114821770B (en) * 2022-04-11 2024-03-26 华南理工大学 Cross-modal pedestrian re-identification method, system, medium and device from text to image
CN115292533A (en) * 2022-08-17 2022-11-04 苏州大学 Cross-modal pedestrian retrieval method driven by visual positioning
CN115292533B (en) * 2022-08-17 2023-06-27 苏州大学 Cross-modal pedestrian retrieval method driven by visual positioning
CN115761222A (en) * 2022-09-27 2023-03-07 阿里巴巴(中国)有限公司 Image segmentation method, remote sensing image segmentation method and device
CN115761222B (en) * 2022-09-27 2023-11-03 阿里巴巴(中国)有限公司 Image segmentation method, remote sensing image segmentation method and device
CN116228897A (en) * 2023-03-10 2023-06-06 北京百度网讯科技有限公司 Image processing method, image processing model and training method
CN116228897B (en) * 2023-03-10 2024-04-23 北京百度网讯科技有限公司 Image processing method, image processing model and training method
WO2024114185A1 (en) * 2023-07-24 2024-06-06 西北工业大学 Pedestrian attribute cross-modal alignment method based on complete attribute identification enhancement
CN117391092A (en) * 2023-12-12 2024-01-12 中南大学 Electronic medical record multi-mode medical semantic alignment method based on contrast learning
CN117391092B (en) * 2023-12-12 2024-03-08 中南大学 Electronic medical record multi-mode medical semantic alignment method based on contrast learning

Similar Documents

Publication Publication Date Title
CN114036336A (en) Semantic division-based pedestrian image searching method based on visual text attribute alignment
Li et al. Know more say less: Image captioning based on scene graphs
Gu et al. An empirical study of language cnn for image captioning
CN112000818B (en) Text and image-oriented cross-media retrieval method and electronic device
Zhu et al. Content-based visual landmark search via multimodal hypergraph learning
US20200327327A1 (en) Providing a response in a session
CN113065577A (en) Multi-modal emotion classification method for targets
WO2019019935A1 (en) Interaction method, interaction terminal, storage medium, and computer device
Islam et al. Exploring video captioning techniques: A comprehensive survey on deep learning methods
Muhammad et al. Casia-face-africa: A large-scale african face image database
Pappas et al. Multilingual visual sentiment concept matching
Zhou Generative adversarial network for text-to-face synthesis and manipulation
CN116737979A (en) Context-guided multi-modal-associated image text retrieval method and system
Emami et al. Arabic image captioning using pre-training of deep bidirectional transformers
Bansal et al. Visual content based video retrieval on natural language queries
CN113157974B (en) Pedestrian retrieval method based on text expression
CN110659392A (en) Retrieval method and device, and storage medium
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
Wang et al. A novel semantic attribute-based feature for image caption generation
CN115409107A (en) Training method of multi-modal association building model and multi-modal data retrieval method
Attai et al. A survey on arabic image captioning systems using deep learning models
CN115081445A (en) Short text entity disambiguation method based on multitask learning
Upadhyay et al. Mood based music playlist generator using convolutional neural network
Kansal et al. Hierarchical attention image-text alignment network for person re-identification
Runyan et al. A Survey on Learning Objects’ Relationship for Image Captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination