CN114036336A

CN114036336A - Semantic division-based pedestrian image searching method based on visual text attribute alignment

Info

Publication number: CN114036336A
Application number: CN202111344497.5A
Authority: CN
Inventors: 杨华; 杨新新
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-02-11

Abstract

The invention provides a pedestrian image searching method based on semantic division and visual text attribute alignment, which comprises the following steps of: processing original data in an image mode and a text mode to obtain data sets of the image overall situation and the text local situation; respectively extracting features of the data set by using an image feature extraction network and a text feature extraction network to obtain global and local features in a single mode of the image and the text; converting global and local features in a single mode into embedded features of a corresponding mode in an embedded network; and carrying out model training under the joint constraint of a plurality of loss functions. According to the method, the local features are divided in a finer granularity, the corresponding relation among the local features is fully utilized, and a backbone network is assisted to extract more aligned global cross-modal embedded features; through training of the overall loss joint constraint network, the model is promoted to converge towards the optimal direction, and the performance of pedestrian image search based on natural language description is improved.

Description

Semantic division-based pedestrian image searching method based on visual text attribute alignment

Technical Field

The invention relates to the field of cross-modal alignment, in particular to a pedestrian image searching method based on semantic division and visual text attribute alignment.

Background

The identification technology aiming at people plays an extremely important role in the field of intelligent video monitoring. The cross-teletext retrieval of pedestrians aims at matching in the picture dataset the most conforming pedestrian picture to a given natural language description of the target pedestrian. Compared with a pedestrian image searching method based on pictures or attributes, the person image searching method based on natural language description can process the complex situation of lacking known images of persons to be searched, has the advantages of more flexibility, convenience and friendliness in application, but has the cost of more complexity and challenge in model processing.

First, for the same image, the text description format is rich, and thus the encoding network of the text needs to be able to process diversified data. And thirdly, the sample data comes from two modes of images and texts, heterogeneous gaps exist in data composition, and semantic gaps exist in characteristics. It is only possible to effectively measure and rank compare the similarity between feature vectors from two modalities by overcoming the differences between modalities.

At present, there are two main ideas: joint embedded learning and similarity learning. The former refers to learning a mapping mode, and projects the features of the image and the text to a common subspace, and the embedded features of the image and the text are directly matched in the common subspace; the latter refers to designing a similarity metric network.

The joint embedding learning is the current mainstream method, and the process comprises the steps of firstly extracting the independent features in the two modes respectively, then mapping the features into a feature public subspace (high-level network) shared by visual texts, and maximizing the correlation of different mode representations of the same sample in the public subspace. The current pedestrian image searching method based on natural language description has the following three categories:

one is based on global features (see Zhang Y, Lu h. deep cross-mode projection leaving for image-text matching [ C ]// Proceedings ofhe European Conference Component (ECCV): 2018: 686-;

second, the general grid image partitioning between the attention area with natural language description [ C ]// Proceedings of IEEE Conference on Computer and Pattern recognition [ 2017:1970 ]; the general grid image partitioning between the attention area with C70, and the attention area with C70, the text is divided into individual words or phrases. And adding an attention mechanism in the network to acquire the similarity between the local part and the local part, between the local part and the whole body and between the whole body and the whole body. However, this method has the disadvantage of only performing gridded partitioning on the image and the text, and does not align meaningful semantic parts in the two modalities. Third, the correlation between Visual and textual semantic components is fully explored (see Niu K, Huang Y, Ouyang W, et al. Improputing description-basedperson re-identification by multi-textual information-textual alignment [ J ]. IEEE Transactions on Image Processing,2020,29: 5542. J. Jing Y, Si C, Wang J, et al. position-derived joint information and attribute local alignment [ J ] for the advanced correlation index (AAAI),2020. Wang, Zhe, et al. Visual-textual alignment [ Visual-metadata ] metadata alignment [ J ] for the semantic alignment of the Visual semantic of the Image, Visual alignment, and comparison section 2020. Visual alignment. The image is segmented based on human body parts, and then the text is correspondingly divided, so that the alignment between the local features of the image and the text is promoted. However, this method is only image-based on the division of local features, and lacks a finer-grained division of text semantics.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a pedestrian image searching method based on semantic division and visual text attribute alignment.

According to one aspect of the invention, a pedestrian image searching method based on semantic division visual text attribute alignment is provided, and comprises the following steps:

processing original data of an image modality and a text modality to obtain data sets of the image overall situation and the text local situation in the corresponding modalities;

respectively extracting features of the data set by using an image feature extraction network and a text feature extraction network to obtain corresponding global and local features in a single mode;

converting global and local features in the single modality into embedded features of a corresponding modality by using an embedded network;

carrying out model training under the joint constraint of the overall loss of the model;

and searching the pedestrian image by using the trained model.

Preferably, the processing raw data in an image modality and a text modality, and the obtaining data sets of an image global and a text global and local includes:

obtaining a global image representation I of a human body_globalIntegral text representation T_global；

The method comprises the following steps of dividing original data in an image modality and a text modality based on human body parts and word parts of speech, wherein the dividing comprises the following steps:

and generating an image segmentation mask based on five body parts of the human body by utilizing the existing human body segmentation network. Inputting the size-normalized image into the human body segmentation network, an image segmentation mask I based on five body parts of the human body can be generated_local-label；

Obtaining a textual representation of each body part using a word-body part correspondence table in conjunction with the correspondence table and an existing natural language processing tool library NLTK

And

preferably, the human body part comprises: head, upper body, lower body, shoes and backpack;

the part of speech of the word comprises: nouns and adjectives;

the word-human body part corresponding table designed aiming at the invention is described;

preferably, the image feature extraction network is F_visualThe first three blocks of Resnet50, with the output dimension set to 1024;

the text feature extraction network is F_textualWhich is Bi-LSTM, the output dimension is set to 512 dimensions.

Preferably, the image feature extraction network F_visualAnd the text feature extraction network F_textualIs that the global features are shared with the local features. And respectively inputting the global image data and the global and local text data into the feature extraction network of the corresponding mode to extract the corresponding single-mode features. The global image I_globalAfter passing through the image feature extraction network F_visualPost-formation of globalImage feature v_global. The whole text T_globalLocal description of text based on noun semantic division

And text local description based on adjective semantic division

Independently pass through F_textualThe global feature t of the corresponding text can be obtained_globalLocal text feature based on noun semantic division

Text local feature based on adjective semantic division

Preferably, the converting, in an embedded network, the global and local features in the single modality into embedded features of a corresponding modality includes:

constructing a feature-embedded network, comprising: image global feature embedding network E_{visual-global}Local feature embedded network of image corresponding to noun semanteme

Local image feature embedding network corresponding to adjective semantics

Text global feature embedded network E_{textual-global}Local feature embedded network of text corresponding to noun semanteme

Local image feature embedding network corresponding to adjective semantics

Acquiring global and local embedded features, comprising:

global image feature v_globalEmbedding into network E through image local feature_{visual-global}Deriving global embedding features v_global-embed；

Global image feature v_globalImage local feature embedded network corresponding through noun semantics

Obtaining local embedding characteristics v of image_noun-embed；

Global image feature v_globalImage local feature embedding network corresponding to adjective meaning

Obtaining local embedding characteristics v of image_adj-embed；

Global feature t of text_globalEmbedding into network E through global features of text_{textual-global}Obtaining global embedded feature t of text_global-embed；

Local feature of text based on noun semantic division

Text local feature embedding network corresponding through noun semantics

Obtaining local embedding characteristics b of text_noun-embed；

Text local feature based on adjective semantic division

Image local feature embedding network corresponding to adjective meaning

Obtaining local embedding characteristics t of text_adj-embed；

Local embedding features v of an image_noun-embedAnd v_adj-embedObtaining class prediction x of image local feature through deconvolution operation_nounAnd x_adj。

Preferably, for each modality, the sub-embedded networks are independent of each other, and parameters are not shared; the first element of all image feature embedding networks is a fourth large block of Resnet50 of different parameters₄). Thus, the global and local features of the image are obtained by processing the original image through complete Resnet50, and the difference is that the fourth block is independent from each other, and generates corresponding features under the constraint of corresponding functions; the first three blocks are shared by global features and local features, and the shared network enables the alignment of the local features to play a constraint on the extraction of the global features in the back propagation process.

Preferably, the overall loss comprises: global alignment loss, local alignment loss, and segmentation loss;

preferably, the function of the overall loss is:

wherein L is_global-alignFor global alignment loss, for constraining the similarity of global embedded features between modalities;

and

respectively constraining the similarity of local embedding characteristics corresponding to nouns and adjective meanings among the modes for local alignment loss;

and

for dividing losses, useThe image local embedding characteristics extracted in the guarantee correspond to five body parts of the human body; lambda [ alpha ]₁And λ₂Representing the weight of the corresponding loss component.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the local features are divided in a finer granularity, the corresponding relation among the local features is fully utilized, and global branches (composed of a global feature extraction network and a global feature embedding network) are assisted to extract more aligned global cross-modal embedding features; through training of a plurality of loss functions combined constraint networks, the model is promoted to converge towards the optimal direction, and the performance of pedestrian image search based on natural language description is improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flowchart of a pedestrian image searching method based on semantic division and visual text attribute alignment according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a principle of a pedestrian image searching method based on semantic division and visual text attribute alignment according to an embodiment of the present invention;

FIG. 3 is a graph of performance of different algorithms on the data set CUHK-PEDES according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, a flowchart of a pedestrian image searching method based on semantic division and visual text attribute alignment according to an embodiment of the present invention includes:

s1, processing the original data of the image mode and the text mode to obtain data sets of the image overall situation and the text local situation in the corresponding modes;

s2, constructing an image feature extraction network and a text feature extraction network;

s3, extracting the data set by using the image feature extraction network and the text feature to obtain the global and local features in the image and text single mode;

s4, constructing an embedded network;

s5, converting global and local features in the single mode into embedded features of corresponding modes in the embedded network;

and S6, performing model training under the joint constraint of the overall loss.

To better perform data preprocessing, an embodiment is provided to perform S1. In the data preprocessing of the embodiment, a global image representation I of a human body is acquired first_globalIntegral text representation T_global. Then, data division is carried out based on human body parts and word parts of speech, and specifically, an existing human body segmentation network is utilized to obtain an image segmentation mask I based on human body parts_local-label(ii) a Combining the manually designed word-part corresponding Table Table-voc and the text word segmentation model to obtain the text representation of the local part

And

s1 is performed as a preferred embodiment, including:

s101, carrying out size normalization on the whole image to obtain a whole image representation I with the size of 384 x 128_global；

S102, dividing the human body into five parts, namely: head, upper body, lower body, shoes and backpack. Generating image segmentation masks of five parts of a human body by using the existing human body segmentation model:

s103, manually collecting a name Table Table-voc (see Table 1) corresponding to five body parts of a human body;

s104, obtaining the part of speech of each word in the original sentence and the coordinates of each word in the sentence by using an NLTK library, searching each noun in Table-voc, and determining the corresponding body part;

specifically, in the present embodiment, it is assumed that all adjectives between the (i-1) th noun and the (i) th noun are used to modify the (i) th noun, and thus the body part corresponding to each adjective can be determined. Thereby enabling the determination of adjectives and nouns in a text description that modify each body part. Thus, local text information based on semantics and human body parts is obtained;

and S105, carrying out one-hot coding on the whole and local texts to form global and local text descriptions. The length of the unified global textual description is 100 and the length of the local textual description is 15. The text is described as T in its entirety. The corresponding local division of each finally generated text description is divided into two groups, wherein one group is a local representation corresponding to a phrase central word noun for representing the object type:

the other set is a corresponding local representation of an adjective describing the nature and state attributes of the thing

And S106, converting the global and local texts into vector representations through word embedding.

TABLE 1 Table-voc

Based on the above S1, S2 is executed to construct an image feature extraction network and a text feature extraction network. To better extract image features and text features, a preferred embodiment is provided. In the present embodiment, the selected image feature extraction network F_visualFor the first three blocks of Resnet50, the output dimension is set to 1024; text feature extraction network F_textualFor Bi-LSTM, the output dimension is set to 512 dimensions. Resnet can solve the problem that the gradient of a deep network disappears, and is a mainstream network in the current image processing field. The Bi-LSTM is a bidirectional long-time memory model and can capture and model context information.

Based on the S2, executing S3, generating a single-mode feature vector by using the constructed image feature extraction network and the text feature extraction network, and obtaining the global feature v of the image_globalGlobal feature t of text_globalLocal text feature based on noun semantic division

Text local feature based on adjective semantic division

All feature extraction networks are shared globally and locally. The part only obtains global image features, global text features and local text features, namely only global features are extracted in an image mode, and then global and local embedded features are generated by the global features.

Specifically, as a preferred embodiment, executing S3 includes:

s301, the global image I processed in the S1 is processed_globalInput to the image feature extraction network F_visualTo generate a global feature v of an image_global：

v_global＝F_visual(I_global)

S302, text data T obtained in S1_global、

And

feature extraction network F for input to text_textualGenerating a global feature t of text_globalLocal features of text corresponding to noun semantics

Local features of text corresponding to adjective semantics

Specifically, the obtained output is:

t_global＝F_textual(T_global)

based on the above S3, S4 is executed to construct a feature-embedded network. In a multi-modal task, the model performance depends largely on the quality of the embedded features. Since the present process is a crucial loop in the overall network, it is aimed at generating richer embedded tokens on the basis of the aforementioned extracted single-modal feature vectors.

The feature embedding network includes: image global feature embedding network E_{visual-global}Local feature embedded network of image corresponding to noun semanteme

Local image feature embedding network corresponding to adjective semantics

Local text feature embedding network corresponding to adjective semantics

The six sub-networks of the feature embedded network are independent and connectionless.

In a preferred embodiment, the global and local embedded networks are not shared, the global features can only input the global feature embedded network of the corresponding modality, and the local features can only input the local feature embedded network of the corresponding modality. The first element of all image feature embedding networks is a fourth large block of Resnet50 of different parameters₄). Thus, the global and local features of the image are obtained by processing the original image through complete Resnet50, and the difference is that the fourth block is independent from each other, and generates corresponding features under the constraint of corresponding functions; the first three blocks are shared by global features and local features, and the shared network enables the alignment of the local features to play a constraint on the extraction of the global features in the back propagation process.

Based on the above S4, S5 is performed, and an embedded feature is generated. Specifically, as a preferred embodiment, executing S5 includes:

s501, inputting the global features of the images in the individual modalities obtained in S3 into the feature embedding network E in S4_{visual-global}、

And

in the method, the global embedded characteristic v of the image is obtained_global-embedLocal embedding feature v of image_noun-embedAnd v_adj-embed. The input and output relations are as follows:

v_global-embed＝E_{visual-global}(v_global)

＝Linear_{visual-global}(Avgpool(block_4-global(v_global)))

v_noun-embedand v_adj-embedObtaining class prediction x of image local feature through deconvolution operation_nounAnd x_adj. Will subsequently pass x_nounAnd x_adjAnd calculating the segmentation loss between the segmentation masks obtained by the human body segmentation network

And

s502, inputting the global and local characteristics of the text in the single mode obtained in S3 into the characteristic embedding network E in S4_{textual-global}、

And

in the method, the global embedded characteristic t of the text is obtained_global-embedLocal embedding feature t of text_noun-embedAnd t_adj-embed. The input and output relationship is as follows:

t＝E_{textual-global}(t_global)＝Linear_{textual-global}(t_global)

block in the input-output formula of the embedded network₄The fourth group of big blocks representing Resnet50, partname representing the corresponding human body part (head \ upper \ lower \ shoes \ backspack), Avgpool representing the average pooling operation, Maxpool representing the maximum pooling operation, and Linear representing the full link operation. Where the dimensions of both global and local embedded features are 256.

The embodiment makes full use of the text information in the division of the local features. On the basis of dividing local features according to human body parts in the prior art, according to different semantemes represented by words with different parts of speech, local information is divided into two groups corresponding to nouns and adjective word meanings, and local features based on different body parts and different semantemes with finer granularity are generated. Since the underlying network of global and local features is shared, alignment of local features may assist global branching in extracting more aligned cross-modal global embedded features through back propagation optimization iterations.

Based on the above S5, S6 is executed, and training is performed using the sum of losses L.

Specifically, as a preferred embodiment, executing S6 includes:

the overall loss function is:

wherein L is_global-alignFor constraining the similarity of the global embedded features between modalities,

and

respectively restricting the similarity of local embedding characteristics corresponding to nouns and adjective meanings between modes,

and

the image local embedding characteristics used for guaranteeing extraction correspond to five body parts of a human body. Lambda [ alpha ]₁And λ₂Representing the weight of the corresponding loss component, in this embodiment λ₁Take 0.80, λ₂Take 0.65.

The expression for each alignment loss component is as follows:

wherein N represents the size of batch _ size, which is 64 in this embodiment; tau is_pAnd τ_nThe temperature parameters of the positive and negative samples to the gradient are adjusted, all tau in this embodiment_pAll take 10, all tau_nAll are taken as 40; s_iRepresenting from two modalitiesDot product of embedded features, superscript "+" indicates positive sample pairs, superscript "-" indicates negative sample pairs:

through the joint constraint of the five loss functions, the constraint on the model is enhanced, and the network is promoted to converge towards the optimal direction.

The invention provides a specific application embodiment for searching images of pedestrians. As shown in fig. 2, a schematic diagram of a pedestrian image searching method based on semantic division and visual text attribute alignment according to this embodiment is shown. 40206 pictures (34054 training sets, 3078 verification sets and 3074 test sets) and 80440 natural language text descriptions (68126 training sets, 6158 verification sets and 6156 test sets) are contained in the CUHK-PEDES data set, and each picture corresponds to about two natural language text descriptions on average.

Experiments prove that the method can extract more aligned global features through the auxiliary extraction of fine-grained local features, the accuracy of cross-modal retrieval is improved, and the table 2 and the attached drawing 3 show that the performance of the method in the embodiment on CUHK-PEDES is compared with that of different algorithms. FIG. 3 is a graph of Cumulative Match Characteristics (CMC), and nine curves represent the performance of the method of this embodiment and other methods from top to bottom.

TABLE 2

It can be seen that the result obtained by the embodiment greatly improves the performance of cross-modal pedestrian retrieval.

In summary, the embodiment method makes full use of semantic information in the text and generates finer-grained local features on the basis of the existing local features divided according to human body parts. And the corresponding relation between the local features is fully utilized, and the global branch (composed of a global feature extraction network and a global feature embedding network) is assisted to extract more aligned global cross-modal embedded features. Through the alignment constraint of the local features, the auxiliary network learns more aligned overall features; through the joint constraint of global alignment loss, local alignment loss and segmentation loss, the model converges towards the optimal direction, and the retrieval performance of pedestrian image searching based on natural language description is improved.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The above-described preferred features may be used in any combination without conflict with each other.

Claims

1. A pedestrian image searching method based on semantic division visual text attribute alignment is characterized by comprising the following steps:

processing original data in an image mode and a text mode to obtain data sets of the image overall situation and the text local situation;

and searching the pedestrian image by using the trained model.

2. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 1, wherein the processing of the raw data in the image modality and the text modality to obtain image global and text global and local data sets comprises:

generating image segmentation mask I based on human body part by using existing human body segmentation network_local-label；

And

3. the semantic division based visual text attribute aligned pedestrian image searching method of claim 2,

the human body part includes: head, upper body, lower body, shoes and backpack;

the part of speech of the word comprises: nouns and adjectives.

4. The semantic division based visual text attribute aligned pedestrian image searching method of claim 1, wherein the image feature extraction network is F_visualThe first three blocks of Resnet50, with the output dimension set to 1024;

the text feature extraction network is F_textualIt is Bi-LSTM, the output dimension is set to 512 dimensions;

the image characteristicsExtraction network F_visualAnd the text feature extraction network F_textualIs that the global features are shared with the local features.

5. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 4, wherein the extracting the data sets by using an image feature extraction network and a text feature extraction network respectively to obtain the corresponding global and local features in a single mode comprises:

the global image data and the global and local text data are respectively input into the feature extraction network of the corresponding mode to extract the corresponding single-mode features, and the method comprises the following steps:

the global image I_globalAfter passing through the image feature extraction network F_visualPost-forming global image features v_global；

The whole text T_globalLocal description of text based on noun semantic division

And text local description based on adjective semantic division

Text local feature based on adjective semantic division

6. The semantic segmentation based visual text attribute aligned pedestrian image searching method of claim 2, wherein the feature embedding network comprises six sub-networks:

image global feature embedding network E_{visual-global}Local feature embedded network of image corresponding to noun semanteme

Local image feature embedding network corresponding to adjective semantics

Local image feature embedding network corresponding to adjective semantics

The global and local multi-modal embedded features comprise:

global embedding feature v_global-embedLocal embedding feature v of image_noun-embedAnd v_adj-embedGlobal embedded feature t of text_global-embedLocal embedding feature t of text_noun-embedAnd t_adj-embed。

7. The semantic division based visual text attribute aligned pedestrian image searching method of claim 6,

the converting global and local features in the single modality into embedded features of a corresponding modality by using an embedded network includes:

Obtaining local embedding characteristics v of image_noun-embed；

Obtaining local embedding characteristics v of image_adj-embed；

Local feature of text based on noun semantic division

Text local feature embedding network corresponding through noun semantics

Obtaining local embedding characteristics t of text_noun-embed；

Text local feature based on adjective semantic division

Image local feature embedding network corresponding to adjective meaning

Obtaining local embedding characteristics t of text_adj-embed；

8. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 6 or 7, wherein for each modality, the six sub-networks of the feature embedding network are independent of each other, and parameters are not shared;

the first link of all image features embedded in the network is a fourth large block (block) of Resnet50 with different parameters₄)。

9. The pedestrian image searching method based on semantic division visual text attribute alignment according to claim 8,

the overall loss includes: global alignment loss, local alignment loss, and segmentation loss.

10. The semantic division based visual text attribute aligned pedestrian image searching method of claim 9,

the function of the overall loss is:

and

and

in order to achieve the segmentation loss,the image local embedding characteristics used for guaranteeing the extracted image correspond to five body parts of a human body; lambda [ alpha ]₁And λ₂Representing the weight of the corresponding loss component.