CN114119975A - Language-guided cross-modal instance segmentation method - Google Patents

Language-guided cross-modal instance segmentation method Download PDF

Info

Publication number
CN114119975A
CN114119975A CN202111408303.3A CN202111408303A CN114119975A CN 114119975 A CN114119975 A CN 114119975A CN 202111408303 A CN202111408303 A CN 202111408303A CN 114119975 A CN114119975 A CN 114119975A
Authority
CN
China
Prior art keywords
modal
features
word
language
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111408303.3A
Other languages
Chinese (zh)
Inventor
王蓉
张文靖
李冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Original Assignee
PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA filed Critical PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Priority to CN202111408303.3A priority Critical patent/CN114119975A/en
Publication of CN114119975A publication Critical patent/CN114119975A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a language-guided cross-modal instance segmentation method, which comprises the following steps: extracting visual features of the image and language features of the expression; capturing global and local detail information of the visual features by using an MSFR module, and fusing the global and local detail information to enhance the representation capability of the visual features; aligning the entity words expressed by the reference with the important image area by utilizing a CAM module, and highlighting all entities in the image; a multi-modal representation of features for mask prediction is generated using a VR module. The method obtains more distinctive feature representation by constructing a CMVR model for example segmentation; local and global detailed information can be captured, and details can be focused; information keywords in the expressions are adaptively aligned with important areas in the input image, and matching between different modal features is facilitated. Meanwhile, the graph-based reasoning can be carried out by means of the relation words, and the instance segmentation can be effectively realized.

Description

Language-guided cross-modal instance segmentation method
Technical Field
The invention relates to an instance segmentation method, in particular to a cross-modal instance segmentation method guided by a language.
Background
Instance Segmentation based on natural language descriptions is a challenging problem, technically known as referential Image Segmentation. Example segmentation based on natural language descriptions is different from traditional computer vision semantic segmentation, which has no limitation of target semantics and classes, while the objects of example segmentation are specified by natural language, and its challenge is the difference between vision and language modalities. The key to the RIS is to learn the efficient multimodal features between visual and linguistic patterns in order to accurately identify the targets being referred to. The task has wide application in human-computer interaction, scene perception, image retrieval, positioning of specific targets in video monitoring and the like, and is one of important contents in the fields of computer vision and pattern recognition.
Earlier example segmentation methods used Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) to extract visual and linguistic monomodal features. Then, a visual characteristic and a language characteristic are fused by adopting a splicing convolution method to generate a segmentation mask. However, cross-modal matching between visual and linguistic features is ineffective because these methods model the speech and visual information separately and lack sufficient interaction between the two methods. An attention mechanism has been gradually introduced later to adaptively focus on information keywords in the linguistic expression and important areas in the image. However, these methods do not fully utilize the context of words in expression, so that it is difficult to form different information in the case where a plurality of entities exist in the same category. Recent approaches mainly learn multimodal context tokens by using different types of informative keywords in expressions. Although these methods have made great progress, none of them have effective visual reasoning capabilities to obtain identifiable multi-modal features.
Disclosure of Invention
In order to solve the defects of the technology, the invention provides a cross-modal instance segmentation method for language guidance.
In order to solve the technical problems, the invention adopts the technical scheme that: a language-guided cross-modal instance segmentation method comprises the following four steps:
step 1, visual features are extracted from an input image through a CNN encoder, and language features are extracted from an expression through a language encoder;
step 2, introducing an MSFR module, obtaining multi-scale refining information from the visual features extracted by the CNN encoder, wherein the multi-scale refining information comprises global and local multi-scale detail information, and fusing the information to enhance the representation capability of the visual features;
step 3, constructing a CAM module, performing cross-modal interaction on the visual features obtained by the CNN encoder and the language features obtained by the language encoder, calculating cross-modal attention, and aligning the information keywords in the expression with corresponding regions in the input image;
and 4, fusing the visual features enhanced by the MSFR module with the language features obtained by the language encoder, inputting the fused features into the VR module, and generating the multi-modal feature representation for mask prediction.
Further, in step 1, the specific process of visual feature extraction is as follows:
1.1 input of 3-channel images
Figure BDA0003373380590000021
Where W, H are the width and height of the image, respectively, and R represents a real number space of dimensions W × H × 3; visual feature extraction through CNN backbone network
Figure BDA0003373380590000022
Wherein w, h, DvWidth, height and channel dimensions of the visual features, respectively; to preserve spatial coordinate information, visual feature x is combinedIConnected to 8-dimensional spatial coordinate features, i.e.
Figure BDA0003373380590000023
Visual features by 1-by-1 convolution
Figure BDA0003373380590000024
To size conversion of
Figure BDA0003373380590000025
Further, in step 1, the specific process of language feature extraction is as follows:
1.2 input expressions
Figure BDA0003373380590000026
Into speech coders, in which SlL is the length of a word and a sentence respectively, and L represents the index of the ith word; first, a pre-trained word embedding model is used to initialize the word embedding of each word, as shown in equation 1:
el=embedding(sl) Formula 1
Wherein the embedding () represents a pre-trained word embedding model; then embedding the words into a bidirectional LSTM network, and coding the context of each word; context of a word indicates hlBy connecting its forward and backward hidden state vectors, it is expressed as shown in equation 2:
Figure BDA0003373380590000031
el denotes the word-embedded representation of the ith word,
Figure BDA0003373380590000032
representing hidden states in both forward and backward directions in a bi-directional LSTM network;
the language features of the whole expression are expressed as
Figure BDA0003373380590000033
{hlDimension representing word context; h islThe implicit state of the represented ith word.
Further, in step 2, the MSFR module is constructed to capture the detail information x of the visual characteristic1(ii) a The module consists of a Local PyConv branch and a Global PyConv branch; the Local PyConv branch is responsible for capturing Local detailed information; the Global PyConv branch is used for capturing Global detail information; the output features of the Local PyConv branch and the Global PyConv branch are fused by convolution of 1 x 1 to obtain a 1024-dimensional feature map containing Local and Global detail information.
Further, the specific processing procedure of step 3 is as follows:
3.1 divide all words in the sentence into four types including entities, attributes, positional relationships and unnecessary words, and calculate a four-dimensional vector to represent the probability that it is one of the four types, the definition of the probability vector is as shown in equation 3:
ql=softmax(W1σ(W0hl+b0)+b1) Formula 3
Wherein the content of the first and second substances,
Figure BDA0003373380590000034
and
Figure BDA0003373380590000035
represents a trainable parameter, σ represents a learnable parameter; a normalized exponential function represented by Softmax (); dl、DhRespectively representing the word classification and the dimension of the feature;
Figure BDA0003373380590000036
respectively represent words SlProbabilities of entities, attributes, relationships, and unnecessary words;
3.2 word-based features
Figure BDA0003373380590000041
And entity and attribute words
Figure BDA0003373380590000042
Calculating cross-modal attention between each word and different image regions; weighted normalization of wordsChemical attention is defined as shown in formulas 4 and 5:
Figure BDA0003373380590000043
Figure BDA0003373380590000044
wherein the content of the first and second substances,
Figure BDA0003373380590000045
and
Figure BDA0003373380590000046
for the transition matrix, Dn represents the hyperparameter, Dx、DhThe dimensions, alpha, representing the visual and word features, respectivelylRepresenting cross-modal attention; beta is alIs a weighted normalized attention, representing the word slWeighted probabilities referring to relevant image regions; aggregating the attention weighted features of all the words to calculate the language feature C of the corresponding entity, wherein the calculation process is as follows:
Figure BDA0003373380590000047
hlthe implicit state of the represented ith word.
Further, the specific processing procedure of step 4 is as follows:
4.1, fusing the global language feature C and the visual feature X of the entity to obtain a multi-modal feature Y; as shown in equation 7:
Y=L2Norm((tanh(WpC)⊙tanh(Wxx))) 7
Wherein the content of the first and second substances,
Figure BDA0003373380590000048
and
Figure BDA0003373380590000049
presentation is disciplinableThe training parameter indicates multiplication of corresponding elements of the matrix, and L2Norm (·) indicates a Norm L2; dn, Dm and Dv respectively represent the dimensionality of a global language feature, a multi-modal feature and a visual feature, and R represents a real number space with Dn multiplied by Dm dimensionality or Dv multiplied by Dm dimensionality;
4.2 transform the size of the multi-modal feature Y to C × W × H, where H, W, C represents the height, width, and number of channels of Y, respectively, and the total number of nodes in the multi-modal graph G is K ═ W × H; representing each position in the multi-modal features as an image region, each image region being represented as a node of the multi-modal graph G, while assigning the multi-modal features Y as features of the node F; the context definition of the relation term weighted normalization on the multi-modal graph edge is shown as formula 8:
Figure BDA0003373380590000051
wherein the content of the first and second substances,
Figure BDA0003373380590000052
representing trainable parameters, DeRepresenting a hyper-parameter, Dl、DhRespectively representing the word classification and the dimension of the feature, and delta represents an activation function;
the definition of the affinity matrix between the multi-modal features of the node and the weighted relation context of the word is as shown in formula 9:
Figure BDA0003373380590000053
wherein the content of the first and second substances,
Figure BDA0003373380590000054
and
Figure BDA0003373380590000055
representing trainable parameters;
Figure BDA0003373380590000056
a feature vector representing a position relation word;
the normalized affinity matrix is then shown in equations 10 and 11:
A1=softmax(AF) Formula 10
A2=softmax(AF T) Formula 11
Wherein the content of the first and second substances,
Figure BDA0003373380590000057
and
Figure BDA0003373380590000058
respectively through affinity matrix AFThe row normalization and the column normalization of A are carried out to obtainF T1; the adjacency matrix is represented as formula 12:
A=A1A2formula 12
Wherein, element a of Ai,jRepresenting the normalized size of the information flow between node i and node j in the multimodal graph G;
4.3 respectively generating a gate for the nodes and the edges of the multi-modal graph so as to limit the information propagation between the nodes; side door gvDefined as the sum of the weighted normalized attention values of the individual words in the expression, as shown in equation 13:
Figure BDA0003373380590000061
wherein, betalRepresents a weighted normalized attention;
door g of nodeeDefined as the sum of all weighted probabilities for a single word in the expression, as shown in equation 14:
Figure BDA0003373380590000062
wherein r islRepresenting the relation word weighted normalization context definition on the multi-mode graph edge;
4.4 Gate g which will adjoin matrix A and edgevMultiplying the characteristic of node F by the gate g of the nodeeMultiplying; then, the node characteristics are updated through graph volume operation, as shown in equations 15-17:
Figure BDA0003373380590000063
Figure BDA0003373380590000064
Figure BDA0003373380590000065
wherein the content of the first and second substances,
Figure BDA0003373380590000066
b1、b2and
Figure BDA0003373380590000067
all of which are trainable parameters that are used to,
Figure BDA0003373380590000068
and
Figure BDA0003373380590000069
features representing a bidirectional relationship, σ (-) represents a sigmoid activation function,
Figure BDA00033733805900000610
denotes the tth reasoning multimodal features, ATA transposed matrix representing matrix a; generating a multi-modal feature representation with more discriminative power through multiple map convolution operations;
4.5, the multi-modal features are subjected to up-sampling on the multi-modal feature map by an ASPP decoder by adopting a deconvolution method, and a segmentation mask is predicted; then, a cross entropy loss function of a pixel level is adopted, and the whole model is trained end to end by utilizing the prediction segmentation mask and the Ground Truth segmentation mask.
Further, the end-to-end training is divided into two stages;
1) training with a low-resolution scale without up-sampling, and then training with a high resolution; the basic parameters are set as follows: the word embedding size and the hidden state size are set to be 1024, and the number of dynamic filters is set to be 10;
2) training is carried out by adopting an Adam optimizer, the initial learning rate is 1 multiplied by 10, the batch size is 1, and the initial learning rate of the Adam optimizer is 1 multiplied by 10-5The graph convolution is set to be 3 layers, and the loss function adopts a cross entropy loss function.
The invention discloses a language-guided cross-modal instance segmentation method, which is used for obtaining more discriminative feature representation for instance segmentation by constructing a cross-modal attention-directed visual reasoning (CMVR) model. Firstly, the model can capture local and global detailed information and pay more attention to details; secondly, the cross-modal attention module adaptively aligns the information keywords in the expression with the important region in the input image, and facilitates matching between different modal features. Meanwhile, the constructed visual reasoning module can carry out graph-based reasoning by means of the relation words, and the designated objects are distinguished from other objects of the same category, so that instance segmentation can be realized more effectively.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
FIG. 2 is a general framework diagram of the cross-modal attention-directed visual inference model of the present invention.
FIG. 3 is a schematic diagram of multi-scale feature optimization according to the present invention.
FIG. 4 is a schematic diagram of the visual inference segmentation of the present invention.
FIG. 5 is a schematic view of the attention visualization of the present invention.
FIG. 6 is a schematic diagram of the output prediction under the fixed image and the changed language expression according to the present invention.
FIG. 7 is a diagram illustrating the results of the language-guided cross-modal example segmentation in accordance with the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
The invention provides a language-Guided Cross-Modal instance segmentation method, which constructs a Cross-Modal Attention Guided Visual Reasoning model (CMVR), wherein the model consists of three modules; firstly, a Multi-scale feature refinement (MSFR) module is constructed, small targets and large targets are processed, Multi-scale local and global optimization Visual characteristics are effectively captured, and are fused to generate Multi-scale features, so that Visual representation is enhanced. Secondly, introducing a Cross-Modal Attention (CMA) module to adaptively align the information keywords in the expression with the important information in the image; the CMA module highlights all entities of the linguistic description by calculating a weighted normalized attention value for the word; meanwhile, the relational context of the words in the expression is calculated to be used as a relational clue and associated with the related entities in the image. Finally, a Visual Reasoning (VR) module is provided, a fully-connected multi-modal graph is constructed by using multi-modal features and expressed relation clues, Visual Reasoning is carried out gradually on the graph, correct entities are highlighted, and other unrelated entities are inhibited; the VR module facilitates generation of a more discriminative multi-modal representation of features for reference to the segmentation. Therefore, the language-guided cross-modal instance segmentation method provided by the invention is based on the pre-trained Chinese word embedded matrix, fully utilizes the semantic information of the target, combines the visual characteristics of the target to form cross-modal attention, and realizes the cross-modal instance segmentation based on the Chinese context.
As shown in fig. 1, the overall flow framework of the language-guided cross-modal instance segmentation method of the present invention includes the following four parts:
firstly, visual features are extracted from an input image through a CNN encoder, and language features are extracted from an expression through a language encoder;
secondly, introducing an MSFR module, obtaining multi-scale refining information from the visual features extracted by the CNN encoder, wherein the multi-scale refining information comprises global and local multi-scale detail information, and fusing the information to enhance the representation capability of the visual features;
thirdly, constructing a CAM module, performing cross-modal interaction on the visual features obtained by the CNN encoder and the language features obtained by the language encoder, calculating cross-modal attention, and aligning the information keywords in the expression with corresponding areas in the input image;
and fourthly, fusing the visual features enhanced by the MSFR module with the language features obtained by the language encoder, inputting the fused features into the VR module, and generating a multi-modal feature representation for mask prediction, namely finally outputting a mask graph of the segmentation target.
As shown in fig. 2, for the overall framework diagram of the cross-modal attention-directed visual inference model proposed by the present invention, the model takes images and images as expressions as input, first, a CNN encoder (specifically, a ResNet101 structure can be used) is used to extract visual features (Feature maps) in the images, and Spatial Location features are spliced to obtain a new visual Feature vector. Then, a pre-trained language encoder (Word Embedding model) is utilized to process the quotation expression to obtain a Word Embedding vector (Word Embedding), and then the Word vector is input into the LSTM network to obtain a language feature vector. The MSFR module takes the visual features (Feature Maps) as input, acquires multi-scale detail information, namely Global multi-scale features (Global Feature Maps) and Local multi-scale features (Local Feature Maps) from the extracted visual Feature Maps, and splices and fuses the Global and Local multi-scale features (Global-Global) to realize the characteristic capability of the enhanced features. The language features and visual features extracted by the CNN encoder are used as inputs to the CMA module to capture inter-pattern dependencies between keywords in the quotation expression and different image areas. Also, the CMA module focuses on all entities mentioned in the expression, and thus may help align the entity words in the expression with important areas in the image. Then, constructing a fully connected multi-modal graph (Multimodal Feature) on the spatial region of the image; in the VR module, multi-modal information among nodes is transmitted through edge connection among the nodes; visual Reasoning (Visual Reasoning) is carried out on the multimodal map by using graph convolution according to the relation in the expression, related entities matched with language description are effectively highlighted, other unrelated entities are suppressed, so that a more discriminative multimodal feature representation is generated, the multimodal features are subjected to up-sampling by an ASPP (advanced Spatial gradient) decoder and a deconvolution method, and a segmentation mask is predicted.
For CNN encoders and speech encoders, the specific process of feature extraction is as follows:
1) inputting 3-channel images
Figure BDA0003373380590000091
Where W, H are the width and height of the image, respectively, and R represents a real number space of dimensions W × H × 3; visual features in a backbone network over CNN
Figure BDA0003373380590000092
Wherein w, h, DvWidth, height and channel dimensions of the visual features, respectively; to preserve spatial coordinate information, visual feature x is combinedIConnected to 8-dimensional spatial coordinate features, i.e.
Figure BDA0003373380590000093
Visual features by 1-by-1 convolution
Figure BDA0003373380590000094
To size conversion of
Figure BDA0003373380590000095
2) Input expression
Figure BDA0003373380590000096
Into speech coders, in which slL is the length of a word and a sentence respectively, and L represents the index of the ith word; first, a pre-trained word embedding model is used to initialize the word embedding of each word, as shown in equation 1:
el=embedding(sl) Formula 1
Wherein the embedding () represents a pre-trained word embedding model; then embedding the words into a bidirectional LSTM network, and coding the context of each word; context of a word indicates hlBy passingConnecting its forward and backward hidden state vectors, as shown in equation 2:
Figure BDA0003373380590000101
el denotes the word-embedded representation of the ith word,
Figure BDA0003373380590000102
representing hidden states in both forward and backward directions in a bi-directional LSTM network;
the language features of the whole expression are expressed as
Figure BDA0003373380590000103
{hlDenotes the dimension of the word context, hlThe implicit state of the represented ith word.
As shown in FIG. 3, a diagram of a multiscale feature refinement (MSFR) module that captures detail information x of visual features1(ii) a The module consists of (Local PyroConv, Local Pyramid fusion branch) and (Global PyroConv, Global Pyramid fusion branch), each layer of branches relating to a different type of kernel size. The Local PyConv branch is mainly responsible for capturing detailed Local information; the Global PyConv branch is used for capturing Global detail information; the method comprises the following specific steps:
1) as shown in the upper dashed portion of fig. 3, the Local PyConv branch is primarily responsible for handling smaller sized entities and capturing Local detail information. It contains four convolution layers, each convolution kernel being 9 × 9, 7 × 7, 5 × 5 and 3 × 3 in size, respectively; the size of the convolution kernel increases from left to right, and the depth of the kernel decreases. Before inputting the feature maps into the Local PyConv branch, first applying 1 × 1 convolution reduces the number of feature maps to 512; then carrying out pyramid convolution of different kernels from left to right to obtain local detailed information under four scales; and finally, combining the information obtained under different scales by adopting 1-by-1 convolution. The branch is behind four convolutional layers, and also includes a normalization layer and a ReLU activation function.
2) As shown in the lower dotted line portion of fig. 3, the Global PyConv branch handles large-sized entities and captures Global detailed information; and meanwhile, the kernel can cover the whole feature map. Firstly, reducing the space size of input feature mapping to 9 × 9 by using self-adaptive average pooling, and reducing the number of feature mapping to 512 by using 1 × 1 convolution; then, cone convolution of different kernels is carried out in sequence; and finally, fusing information of different scales by using 1-by-1 convolution. In addition, before the adaptive average pooling, the feature map is reconstructed to the original size by upsampling using bilinear interpolation. The output property maps from the Local PyConv branch and Global PyConv branch are concatenated to obtain 1024 property maps containing Local and Global detailed information.
In the CAM module, the specific processing procedure is as follows:
3.1 divide all words in the sentence into four types including entity (Entry), Property (Property), Location (Location) and Unnecessary words (Uncecessary words), and calculate a four-dimensional vector to represent the probability that it is one of the four types, the definition of the probability vector is as shown in equation 3:
ql=sof tmax(W1σ(W0hl+b0)+b1) Formula 3
Wherein the content of the first and second substances,
Figure BDA0003373380590000111
and
Figure BDA0003373380590000112
represents a trainable parameter, σ represents a learnable parameter; a normalized exponential function represented by Softmax (); dl、DhRespectively representing the word classification and the dimension of the feature;
Figure BDA0003373380590000113
respectively represent words SlProbabilities of entities, attributes, relationships, and unnecessary words;
3.2 word-based features
Figure BDA0003373380590000114
And entities and generaSex word
Figure BDA0003373380590000115
Calculating cross-modal attention between each word and different image regions; the weighted normalization attention definition of words is shown in equations 4 and 5:
Figure BDA0003373380590000116
Figure BDA0003373380590000117
wherein the content of the first and second substances,
Figure BDA0003373380590000118
and
Figure BDA0003373380590000119
to transfer the matrix, DnRepresenting a hyper-parameter, Dx、DhThe dimensions, alpha, representing the visual and word features, respectivelylRepresenting cross-modal attention; beta is alIs a weighted normalized attention, representing the word slWeighted probabilities referring to relevant image regions; aggregating the attention weighted features of all the words to calculate the language feature C of the corresponding entity, wherein the calculation process is as follows:
Figure BDA0003373380590000121
hlthe implicit state of the represented ith word.
FIG. 4 is a schematic diagram of a visual inference (VR) module in which a fully connected multimodal map is constructed over a spatial region of an image in order to effectively highlight relevant entities that match a linguistic description, and suppress other irrelevant entities. The multimodal map is a map representing G ═ (V, E, F, A), where
Figure BDA0003373380590000122
Is a set of vertices, v, containing multi-modal information of an entitykRepresents the kth node, and K represents the number of nodes;
Figure BDA0003373380590000123
is a set of edges, ei,jIs an edge corresponding to between neighboring nodes, i, j represents a neighboring node in the multi-modal graph;
Figure BDA0003373380590000124
is the collective property of the vertices, Fi represents a multi-modal feature in the ith node; multi-modal information between nodes is propagated through edge connections between nodes; and (4) visually reasoning the multi-modal graph according to the relation in the expression by graph convolution. The method comprises the following specific steps:
4.1, fusing the global language feature C and the visual feature X of the entity to obtain a multi-modal feature Y; as shown in equation 7:
Y=L2Norm((tanh(WpC)⊙tanh(WXx))) 7
Wherein the content of the first and second substances,
Figure BDA0003373380590000125
and
Figure BDA0003373380590000126
indicates a trainable parameter,. indicates multiplication of corresponding elements of the matrix,. L2Norm (. cndot.) indicates a Norm of L2; dn, Dm and Dv respectively represent the dimensionality of a global language feature, a multi-modal feature and a visual feature, and R represents a real number space with Dn multiplied by Dm dimensionality or Dv multiplied by Dm dimensionality;
4.2 transform the size of the multi-modal feature Y to C × W × H, where H, W, C represents the height, width, and number of channels of Y, respectively, and the total number of nodes in the multi-modal graph G is K ═ W × H; representing each position in the multi-modal features as an image region, each image region being represented as a node of the multi-modal graph G, while assigning the multi-modal features Y as features of the node F; the context definition of the relation term weighted normalization on the multi-modal graph edge is shown as formula 8:
Figure BDA0003373380590000127
wherein the content of the first and second substances,
Figure BDA0003373380590000131
representing trainable parameters, DeRepresenting a hyper-parameter, Dl、DhRespectively representing the word classification and the dimension of the feature, and delta represents an activation function;
the definition of the affinity matrix between the multi-modal features of the node and the weighted relation context of the word is as shown in formula 9:
Figure BDA0003373380590000132
wherein the content of the first and second substances,
Figure BDA0003373380590000133
and
Figure BDA0003373380590000134
representing trainable parameters;
Figure BDA0003373380590000135
a feature vector representing a position relation word;
the normalized affinity matrix is then shown in equations 10 and 11:
A1=softmax(AF) Formula 10
A2=softmax(AF T) Formula 11
Wherein the content of the first and second substances,
Figure BDA0003373380590000136
and
Figure BDA0003373380590000137
respectively through affinity matrix AFThe row normalization and the column normalization of A are carried out to obtainF T1; the adjacency matrix is represented as formula 12:
A=A1A2formula 12
Wherein, element a of Ai,jRepresenting the normalized size of the information flow between node i and node j in the multimodal graph G;
4.3 in a fully connected multimodal graph, multimodal information of nodes is propagated through adjacency matrices. However, the information in some nodes is independent, and meanwhile, the unconstrained information propagation among the nodes may introduce a great deal of noise and redundant information, so that the result is disordered; therefore, in the invention, a gate is respectively generated for the nodes and the edges of the multi-modal graph so as to limit the information propagation between the nodes; side door gvDefined as the sum of the weighted normalized attention values of the individual words in the expression, as shown in equation 13:
Figure BDA0003373380590000138
wherein, betalRepresents a weighted normalized attention;
door g of nodeeDefined as the sum of all weighted probabilities for a single word in the expression, as shown in equation 14:
Figure BDA0003373380590000141
wherein r islRepresenting the relation word weighted normalization context definition on the multi-mode graph edge;
4.4 Gate g which will adjoin matrix A and edgevMultiplying the characteristic of node F by the gate g of the nodeeMultiplying; then, the node characteristics are updated through graph volume operation, as shown in equations 15-17:
Figure BDA0003373380590000142
Figure BDA0003373380590000143
Figure BDA0003373380590000144
wherein the content of the first and second substances,
Figure BDA0003373380590000145
b1、b2and
Figure BDA0003373380590000146
all of which are trainable parameters that are used to,
Figure BDA0003373380590000147
and
Figure BDA0003373380590000148
features representing a bidirectional relationship, σ (-) represents a sigmoid activation function,
Figure BDA0003373380590000149
denotes the tth reasoning multimodal features, ATA transposed matrix representing matrix a; generating a multi-modal feature representation with more discriminative power through multiple map convolution operations;
4.5, the multi-modal characteristics are subjected to up-sampling on the multi-modal characteristic graph by an ASPP (advanced Spatial clustering Pooling) decoder by adopting a deconvolution method, and a segmentation mask is predicted; then, performing end-to-end training on the whole model by adopting a pixel-level cross entropy loss function and utilizing a prediction segmentation mask and a group Truth segmentation mask;
the model is trained end-to-end, and the training is divided into two stages. Firstly, training with a low-resolution scale without up-sampling, and then training with a high resolution; the basic parameters are set as follows: the word embedding size and the hidden state size are both set to 1024, and the number of dynamic filters is set to 10. The Adamoptizer optimizer is adopted for training, the initial learning rate is 1 multiplied by 10, the batch size is 1, and the Adama optimizer uses the 1 multiplied by 10 initial learning rate-5The volume of the graph is set to be 3 layers,the loss function is a cross-entropy loss function.
To better represent the advantages of cross-modal attention in the model, an attention thermodynamic diagram between the image and the representation is visualized, as shown in FIG. 5. In the figure, (a) is an original image, (b) and (c) are attention figures of only a single word and a complete expression in a test sample; (d) is the prediction result of the invention; (e) by manually modifying the expressed intent.
As shown in fig. 6, in the case of a fixed image, prediction of the result of the segmentation is performed by changing expressions, such as "bottom center brown donut" indicating the third donut in the third row and "top row middle rounded donut" indicating the second donut in the first row, so that segmentation can still be performed accurately. As can be seen from the figure, the model is able to correctly segment all instances, significantly highlighting the flexibility and adaptability of the proposed model.
As shown in fig. 7, each column in the graph represents the original image, the prediction result produced by the group-route, RNN-CNN as baseline (baseline) method, the prediction result produced by the baseline + CMA module, the prediction result produced by the baseline + MSRF module, and the prediction result produced by the present invention, respectively, for the cross-modal image segmentation result guided by the language in the present invention. Line 1 (f) of FIG. 7 indicates that the CMVR can segment objects that are heavily occluded and have no locator words; line 2 (f) of fig. 7 shows that CMVR can distinguish objects in a similar background. In contrast, the baseline model is able to correctly segment the instances to which the language refers, but confuses it with a similar context. As can be seen from line 3 of FIG. 7, the baseline model does not locate the orange in the graph (e.g., "lite orange to the right of the big orange in the back"), and the present invention successfully distinguishes the indicated objects in complex scenes with attention directed thereto. Line 1 (e) of FIG. 7 shows that the CMVR can capture global and local detailed information, accurately segment the reference points, and highlight the potential of the CMVR to perform relative position reasoning. Line 4 (f) of fig. 7 shows the ability of CMVR to accurately identify a reference object, which illustrates that the present invention can effectively reduce the edge blocking effect and highlight the details of the reference object.
The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make variations, modifications, additions or substitutions within the technical scope of the present invention.

Claims (7)

1. A language-guided cross-modal instance segmentation method is characterized by comprising the following steps: the method comprises the following four steps:
step 1, visual features are extracted from an input image through a CNN encoder, and language features are extracted from an expression through a language encoder;
step 2, introducing an MSFR module, obtaining multi-scale refining information from the visual features extracted by the CNN encoder, wherein the multi-scale refining information comprises global and local multi-scale detail information, and fusing the information to enhance the representation capability of the visual features;
step 3, constructing a CAM module, performing cross-modal interaction on the visual features obtained by the CNN encoder and the language features obtained by the language encoder, calculating cross-modal attention, and aligning the information keywords in the expression with corresponding regions in the input image;
and 4, fusing the visual features enhanced by the MSFR module with the language features obtained by the language encoder, inputting the fused features into the VR module, and generating the multi-modal feature representation for mask prediction.
2. The method of language-guided cross-modal instance segmentation according to claim 1, wherein: in step 1, the specific process of visual feature extraction is as follows:
1.1 input of 3-channel images
Figure FDA0003373380580000011
Where W, H are the width and height of the image, respectively, and R represents a real number space of dimensions W × H × 3; visual feature extraction through CNN backbone network
Figure FDA0003373380580000012
Wherein w, h, DvWidth, height and channel dimensions of the visual features, respectively; to preserve spatial coordinate information, visual feature x is combinedIConnected to 8-dimensional spatial coordinate features, i.e.
Figure FDA0003373380580000013
Visual features by 1-by-1 convolution
Figure FDA0003373380580000014
To size conversion of
Figure FDA0003373380580000015
3. The method of claim 2, wherein: in step 1, the specific process of language feature extraction is as follows:
1.2 input expressions
Figure FDA0003373380580000016
Into speech coders, in which SlL is the length of a word and a sentence respectively, and L represents the index of the ith word; first, a pre-trained word embedding model is used to initialize the word embedding of each word, as shown in equation 1:
el=embedding(sl) Formula 1
Wherein the embedding () represents a pre-trained word embedding model; then embedding the words into a bidirectional LSTM network, and coding the context of each word; context of a word indicates hlBy connecting its forward and backward hidden state vectors, it is expressed as shown in equation 2:
Figure FDA0003373380580000021
el denotes the word-embedded representation of the ith word,
Figure FDA0003373380580000022
representing hidden states in both forward and backward directions in a bi-directional LSTM network;
the language features of the whole expression are expressed as
Figure FDA0003373380580000023
{hlDimension representing word context; h islThe implicit state of the represented ith word.
4. The method of language-guided cross-modal instance segmentation according to claim 1, wherein: in step 2, the MSFR module is constructed to capture the detail information x of the visual characteristic1(ii) a The module consists of a LocalPyConv branch and a Global PyConv branch; the Local PyConv branch is responsible for capturing Local detailed information; the Global PyConv branch is used for capturing Global detail information; the output features of the Local PyConv branch and the global PyConv branch are fused by convolution of 1 x 1 to obtain a 1024-dimensional feature map containing Local and global detail information.
5. The method of language-guided cross-modal instance segmentation according to claim 1, wherein: the specific treatment process of the step 3 is as follows:
3.1 divide all words in the sentence into four types including entities, attributes, positional relationships and unnecessary words, and calculate a four-dimensional vector to represent the probability that it is one of the four types, the definition of the probability vector is as shown in equation 3:
ql=softmax(W1σ(W0hl+b0)+b1) Formula 3
Wherein the content of the first and second substances,
Figure FDA0003373380580000024
and
Figure FDA0003373380580000025
represents a trainable parameter, σ represents a learnable parameter; a normalized exponential function represented by Softmax ();Dl、Dhrespectively representing the word classification and the dimension of the feature;
Figure FDA0003373380580000026
respectively represent words S1Probabilities of entities, attributes, relationships, and unnecessary words;
3.2 word-based features
Figure FDA0003373380580000031
And entity and attribute words
Figure FDA0003373380580000032
Calculating cross-modal attention between each word and different image regions; the weighted normalization attention definition of words is shown in equations 4 and 5:
Figure FDA0003373380580000033
Figure FDA0003373380580000034
wherein the content of the first and second substances,
Figure FDA0003373380580000035
and
Figure FDA0003373380580000036
to transfer the matrix, DnRepresenting a hyper-parameter, Dx、DhThe dimensions, alpha, representing the visual and word features, respectivelylRepresenting cross-modal attention; beta is alIs a weighted normalized attention, representing the word slWeighted probabilities referring to relevant image regions; aggregating the attention weighted features of all the words to calculate the language feature C of the corresponding entity, wherein the calculation process is as follows:
Figure FDA0003373380580000037
hlthe implicit state of the represented ith word.
6. The method of claim 4, wherein: the specific treatment process of the step 4 is as follows:
4.1, fusing the global language feature C and the visual feature X of the entity to obtain a multi-modal feature Y; as shown in equation 7:
Y=L2Norm((tanh(WpC)⊙tanh(WXx))) of formula7
Wherein the content of the first and second substances,
Figure FDA0003373380580000038
and
Figure FDA0003373380580000039
indicates a trainable parameter,. indicates multiplication of corresponding elements of the matrix,. L2Norm (. cndot.) indicates a Norm of L2; dn, Dm and Dv respectively represent the dimensionality of a global language feature, a multi-modal feature and a visual feature, and R represents a real number space with Dn multiplied by Dm dimensionality or Dv multiplied by Dm dimensionality;
4.2 transform the size of the multi-modal feature Y to C × W × H, where H, W, C represents the height, width, and number of channels of Y, respectively, and the total number of nodes in the multi-modal graph G is K ═ W × H; representing each position in the multi-modal features as an image region, each image region being represented as a node of the multi-modal graph G, while assigning the multi-modal features Y as features of the node F; the context definition of the relation term weighted normalization on the multi-modal graph edge is shown as formula 8:
Figure FDA0003373380580000041
wherein the content of the first and second substances,
Figure FDA0003373380580000042
representing trainable parameters, DeRepresenting a hyper-parameter, Dl、DhRespectively representing the word classification and the dimension of the feature, and delta represents an activation function;
the definition of the affinity matrix between the multi-modal features of the node and the weighted relation context of the word is different from that of equation 9:
Figure FDA0003373380580000043
wherein the content of the first and second substances,
Figure FDA0003373380580000044
and
Figure FDA0003373380580000045
representing trainable parameters;
Figure FDA0003373380580000046
a feature vector representing a position relation word;
the normalized affinity matrix is then shown in equations 10 and 11:
A1=softmax(AF) Formula 10
A2=softmax(AF T) Formula 11
Wherein the content of the first and second substances,
Figure FDA0003373380580000047
and
Figure FDA0003373380580000048
respectively through affinity matrix AFThe row normalization and the column normalization of A are carried out to obtainF T(ii) a The adjacency matrix is represented as formula 12:
A=A1A2formula (II)12
Wherein, element a of Ai,jRepresenting the normalized size of the information flow between node i and node j in the multimodal graph G;
4.3 respectively generating a gate for the nodes and the edges of the multi-modal graph so as to limit the information propagation between the nodes; side door gvDefined as the sum of the weighted normalized attention values of the individual words in the expression, as opposed to equation 13:
Figure FDA0003373380580000051
wherein, betalRepresents a weighted normalized attention;
door g of nodeeDefined as the sum of all weighted probabilities for a single word in the expression, as shown in equation 14:
Figure FDA0003373380580000052
wherein r islRepresenting the relation word weighted normalization context definition on the multi-mode graph edge;
4.4 Gate g which will adjoin matrix A and edgevMultiplying the characteristic of node F by the gate g of the nodeeMultiplying; then, the node characteristics are updated through graph volume operation, as shown in equations 15-17:
Figure FDA0003373380580000053
Figure FDA0003373380580000054
Figure FDA0003373380580000055
wherein the content of the first and second substances,
Figure FDA0003373380580000056
b1、b2and
Figure FDA0003373380580000057
all of which are trainable parameters that are used to,
Figure FDA0003373380580000058
and
Figure FDA0003373380580000059
features representing a bidirectional relationship, σ (-) represents a sigmoid activation function,
Figure FDA00033733805800000510
denotes the tth reasoning multimodal features, ATA transposed matrix representing matrix a; generating a multi-modal feature representation with more discriminative power through multiple map convolution operations;
4.5, the multi-modal features are subjected to up-sampling on the multi-modal feature map by an ASPP decoder by adopting a deconvolution method, and a segmentation mask is predicted; then, a cross entropy loss function of a pixel level is adopted, and the whole model is trained end to end by utilizing the prediction segmentation mask and the Ground Truth segmentation mask.
7. The method of claim 6, wherein: the end-to-end training is divided into two stages;
1) training with a low-resolution scale without up-sampling, and then training with a high resolution; the basic parameters are set as follows: the word embedding size and the hidden state size are set to be 1024, and the number of dynamic filters is set to be 10;
2) the Adam optimizer is adopted for training, the initial learning rate is 1 x 10, the batch size is 1, the Adam optimizer uses 1 x 10 for the initial learning rate, the graph convolution is set to be 3 layers, and the loss function adopts a cross entropy loss function.
CN202111408303.3A 2021-11-25 2021-11-25 Language-guided cross-modal instance segmentation method Pending CN114119975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111408303.3A CN114119975A (en) 2021-11-25 2021-11-25 Language-guided cross-modal instance segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111408303.3A CN114119975A (en) 2021-11-25 2021-11-25 Language-guided cross-modal instance segmentation method

Publications (1)

Publication Number Publication Date
CN114119975A true CN114119975A (en) 2022-03-01

Family

ID=80372447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111408303.3A Pending CN114119975A (en) 2021-11-25 2021-11-25 Language-guided cross-modal instance segmentation method

Country Status (1)

Country Link
CN (1) CN114119975A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677515A (en) * 2022-04-25 2022-06-28 电子科技大学 Weak supervision semantic segmentation method based on inter-class similarity
CN114842312A (en) * 2022-05-09 2022-08-02 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN115019037A (en) * 2022-05-12 2022-09-06 北京百度网讯科技有限公司 Object segmentation method, training method and device of corresponding model and storage medium
CN115049899A (en) * 2022-08-16 2022-09-13 粤港澳大湾区数字经济研究院(福田) Model training method, reference expression generation method and related equipment
CN116644788A (en) * 2023-07-27 2023-08-25 山东交通学院 Local refinement and global reinforcement network for vehicle re-identification
CN117593527A (en) * 2024-01-18 2024-02-23 厦门大学 Directional 3D instance segmentation method based on chain perception
CN117593527B (en) * 2024-01-18 2024-05-24 厦门大学 Directional 3D instance segmentation method based on chain perception

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677515A (en) * 2022-04-25 2022-06-28 电子科技大学 Weak supervision semantic segmentation method based on inter-class similarity
CN114842312A (en) * 2022-05-09 2022-08-02 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN114842312B (en) * 2022-05-09 2023-02-10 深圳市大数据研究院 Generation and segmentation method and device for unpaired cross-modal image segmentation model
CN115019037A (en) * 2022-05-12 2022-09-06 北京百度网讯科技有限公司 Object segmentation method, training method and device of corresponding model and storage medium
CN115049899A (en) * 2022-08-16 2022-09-13 粤港澳大湾区数字经济研究院(福田) Model training method, reference expression generation method and related equipment
CN115049899B (en) * 2022-08-16 2022-11-11 粤港澳大湾区数字经济研究院(福田) Model training method, reference expression generation method and related equipment
CN116644788A (en) * 2023-07-27 2023-08-25 山东交通学院 Local refinement and global reinforcement network for vehicle re-identification
CN116644788B (en) * 2023-07-27 2023-10-03 山东交通学院 Local refinement and global reinforcement network for vehicle re-identification
CN117593527A (en) * 2024-01-18 2024-02-23 厦门大学 Directional 3D instance segmentation method based on chain perception
CN117593527B (en) * 2024-01-18 2024-05-24 厦门大学 Directional 3D instance segmentation method based on chain perception

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN114119975A (en) Language-guided cross-modal instance segmentation method
CN109711426B (en) Pathological image classification device and method based on GAN and transfer learning
CN111325751A (en) CT image segmentation system based on attention convolution neural network
CN111260740A (en) Text-to-image generation method based on generation countermeasure network
CN112446476A (en) Neural network model compression method, device, storage medium and chip
CN112288011B (en) Image matching method based on self-attention deep neural network
CN111340814A (en) Multi-mode adaptive convolution-based RGB-D image semantic segmentation method
CN109712108B (en) Visual positioning method for generating network based on diversity discrimination candidate frame
CN116309648A (en) Medical image segmentation model construction method based on multi-attention fusion
CN111598183A (en) Multi-feature fusion image description method
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
US20240013564A1 (en) System, devices and/or processes for training encoder and/or decoder parameters for object detection and/or classification
CN114648535A (en) Food image segmentation method and system based on dynamic transform
CN114201592A (en) Visual question-answering method for medical image diagnosis
CN111899203A (en) Real image generation method based on label graph under unsupervised training and storage medium
CN110633706B (en) Semantic segmentation method based on pyramid network
Agrawal et al. Image Caption Generator Using Attention Mechanism
CN115222998A (en) Image classification method
Jian et al. Dual-Branch-UNet: A Dual-Branch Convolutional Neural Network for Medical Image Segmentation.
Yao et al. Transformers and CNNs fusion network for salient object detection
Dong et al. Research on image classification based on capsnet
Sudhakaran et al. Gate-shift-fuse for video action recognition
US11948090B2 (en) Method and apparatus for video coding
CN113159053A (en) Image recognition method and device and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination