CN114119975A

CN114119975A - Language-guided cross-modal instance segmentation method

Info

Publication number: CN114119975A
Application number: CN202111408303.3A
Authority: CN
Inventors: 王蓉; 张文靖; 李冲
Original assignee: PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Current assignee: PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-03-01

Abstract

The invention discloses a language-guided cross-modal instance segmentation method, which comprises the following steps: extracting visual features of the image and language features of the expression; capturing global and local detail information of the visual features by using an MSFR module, and fusing the global and local detail information to enhance the representation capability of the visual features; aligning the entity words expressed by the reference with the important image area by utilizing a CAM module, and highlighting all entities in the image; a multi-modal representation of features for mask prediction is generated using a VR module. The method obtains more distinctive feature representation by constructing a CMVR model for example segmentation; local and global detailed information can be captured, and details can be focused; information keywords in the expressions are adaptively aligned with important areas in the input image, and matching between different modal features is facilitated. Meanwhile, the graph-based reasoning can be carried out by means of the relation words, and the instance segmentation can be effectively realized.

Description

Language-guided cross-modal instance segmentation method

Technical Field

The invention relates to an instance segmentation method, in particular to a cross-modal instance segmentation method guided by a language.

Background

Instance Segmentation based on natural language descriptions is a challenging problem, technically known as referential Image Segmentation. Example segmentation based on natural language descriptions is different from traditional computer vision semantic segmentation, which has no limitation of target semantics and classes, while the objects of example segmentation are specified by natural language, and its challenge is the difference between vision and language modalities. The key to the RIS is to learn the efficient multimodal features between visual and linguistic patterns in order to accurately identify the targets being referred to. The task has wide application in human-computer interaction, scene perception, image retrieval, positioning of specific targets in video monitoring and the like, and is one of important contents in the fields of computer vision and pattern recognition.

Earlier example segmentation methods used Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) to extract visual and linguistic monomodal features. Then, a visual characteristic and a language characteristic are fused by adopting a splicing convolution method to generate a segmentation mask. However, cross-modal matching between visual and linguistic features is ineffective because these methods model the speech and visual information separately and lack sufficient interaction between the two methods. An attention mechanism has been gradually introduced later to adaptively focus on information keywords in the linguistic expression and important areas in the image. However, these methods do not fully utilize the context of words in expression, so that it is difficult to form different information in the case where a plurality of entities exist in the same category. Recent approaches mainly learn multimodal context tokens by using different types of informative keywords in expressions. Although these methods have made great progress, none of them have effective visual reasoning capabilities to obtain identifiable multi-modal features.

Disclosure of Invention

In order to solve the defects of the technology, the invention provides a cross-modal instance segmentation method for language guidance.

In order to solve the technical problems, the invention adopts the technical scheme that: a language-guided cross-modal instance segmentation method comprises the following four steps:

step 1, visual features are extracted from an input image through a CNN encoder, and language features are extracted from an expression through a language encoder;

step 2, introducing an MSFR module, obtaining multi-scale refining information from the visual features extracted by the CNN encoder, wherein the multi-scale refining information comprises global and local multi-scale detail information, and fusing the information to enhance the representation capability of the visual features;

step 3, constructing a CAM module, performing cross-modal interaction on the visual features obtained by the CNN encoder and the language features obtained by the language encoder, calculating cross-modal attention, and aligning the information keywords in the expression with corresponding regions in the input image;

and 4, fusing the visual features enhanced by the MSFR module with the language features obtained by the language encoder, inputting the fused features into the VR module, and generating the multi-modal feature representation for mask prediction.

Further, in step 1, the specific process of visual feature extraction is as follows:

1.1 input of 3-channel images

Where W, H are the width and height of the image, respectively, and R represents a real number space of dimensions W × H × 3; visual feature extraction through CNN backbone network

Wherein w, h, D_vWidth, height and channel dimensions of the visual features, respectively; to preserve spatial coordinate information, visual feature x is combined_IConnected to 8-dimensional spatial coordinate features, i.e.

Visual features by 1-by-1 convolution

To size conversion of

Further, in step 1, the specific process of language feature extraction is as follows:

1.2 input expressions

Into speech coders, in which S_lL is the length of a word and a sentence respectively, and L represents the index of the ith word; first, a pre-trained word embedding model is used to initialize the word embedding of each word, as shown in equation 1:

e_l＝embedding(s_l) Formula 1

Wherein the embedding () represents a pre-trained word embedding model; then embedding the words into a bidirectional LSTM network, and coding the context of each word; context of a word indicates h_lBy connecting its forward and backward hidden state vectors, it is expressed as shown in equation 2:

el denotes the word-embedded representation of the ith word,

representing hidden states in both forward and backward directions in a bi-directional LSTM network;

the language features of the whole expression are expressed as

{h_lDimension representing word context; h is_lThe implicit state of the represented ith word.

Further, in step 2, the MSFR module is constructed to capture the detail information x of the visual characteristic₁(ii) a The module consists of a Local PyConv branch and a Global PyConv branch; the Local PyConv branch is responsible for capturing Local detailed information; the Global PyConv branch is used for capturing Global detail information; the output features of the Local PyConv branch and the Global PyConv branch are fused by convolution of 1 x 1 to obtain a 1024-dimensional feature map containing Local and Global detail information.

Further, the specific processing procedure of step 3 is as follows:

3.1 divide all words in the sentence into four types including entities, attributes, positional relationships and unnecessary words, and calculate a four-dimensional vector to represent the probability that it is one of the four types, the definition of the probability vector is as shown in equation 3:

q_l＝softmax(W₁σ(W₀h_l+b₀)+b₁) Formula 3

Wherein the content of the first and second substances,

and

represents a trainable parameter, σ represents a learnable parameter; a normalized exponential function represented by Softmax (); d_l、D_hRespectively representing the word classification and the dimension of the feature;

respectively represent words S_lProbabilities of entities, attributes, relationships, and unnecessary words;

3.2 word-based features

And entity and attribute words

Calculating cross-modal attention between each word and different image regions; weighted normalization of wordsChemical attention is defined as shown in formulas 4 and 5:

wherein the content of the first and second substances,

and

for the transition matrix, Dn represents the hyperparameter, D_x、D_hThe dimensions, alpha, representing the visual and word features, respectively_lRepresenting cross-modal attention; beta is a_lIs a weighted normalized attention, representing the word s_lWeighted probabilities referring to relevant image regions; aggregating the attention weighted features of all the words to calculate the language feature C of the corresponding entity, wherein the calculation process is as follows:

h_lthe implicit state of the represented ith word.

Further, the specific processing procedure of step 4 is as follows:

4.1, fusing the global language feature C and the visual feature X of the entity to obtain a multi-modal feature Y; as shown in equation 7:

Y＝L2Norm((tanh(W_pC)⊙tanh(W_xx))) 7

Wherein the content of the first and second substances,

and

presentation is disciplinableThe training parameter indicates multiplication of corresponding elements of the matrix, and L2Norm (·) indicates a Norm L2; dn, Dm and Dv respectively represent the dimensionality of a global language feature, a multi-modal feature and a visual feature, and R represents a real number space with Dn multiplied by Dm dimensionality or Dv multiplied by Dm dimensionality;

4.2 transform the size of the multi-modal feature Y to C × W × H, where H, W, C represents the height, width, and number of channels of Y, respectively, and the total number of nodes in the multi-modal graph G is K ═ W × H; representing each position in the multi-modal features as an image region, each image region being represented as a node of the multi-modal graph G, while assigning the multi-modal features Y as features of the node F; the context definition of the relation term weighted normalization on the multi-modal graph edge is shown as formula 8:

wherein the content of the first and second substances,

representing trainable parameters, D_eRepresenting a hyper-parameter, D_l、D_hRespectively representing the word classification and the dimension of the feature, and delta represents an activation function;

the definition of the affinity matrix between the multi-modal features of the node and the weighted relation context of the word is as shown in formula 9:

wherein the content of the first and second substances,

and

representing trainable parameters;

a feature vector representing a position relation word;

the normalized affinity matrix is then shown in equations 10 and 11:

A₁＝softmax(A_F) Formula 10

A₂＝softmax(A_F ^T) Formula 11

Wherein the content of the first and second substances,

and

respectively through affinity matrix A_FThe row normalization and the column normalization of A are carried out to obtain_F ^T1; the adjacency matrix is represented as formula 12:

A＝A₁A₂formula 12

Wherein, element a of A_i，jRepresenting the normalized size of the information flow between node i and node j in the multimodal graph G;

4.3 respectively generating a gate for the nodes and the edges of the multi-modal graph so as to limit the information propagation between the nodes; side door g^vDefined as the sum of the weighted normalized attention values of the individual words in the expression, as shown in equation 13:

wherein, beta_lRepresents a weighted normalized attention;

door g of node^eDefined as the sum of all weighted probabilities for a single word in the expression, as shown in equation 14:

wherein r is_lRepresenting the relation word weighted normalization context definition on the multi-mode graph edge;

4.4 Gate g which will adjoin matrix A and edge^vMultiplying the characteristic of node F by the gate g of the node^eMultiplying; then, the node characteristics are updated through graph volume operation, as shown in equations 15-17:

wherein the content of the first and second substances,

b₁、b₂and

all of which are trainable parameters that are used to,

and

features representing a bidirectional relationship, σ (-) represents a sigmoid activation function,

denotes the tth reasoning multimodal features, A^TA transposed matrix representing matrix a; generating a multi-modal feature representation with more discriminative power through multiple map convolution operations;

4.5, the multi-modal features are subjected to up-sampling on the multi-modal feature map by an ASPP decoder by adopting a deconvolution method, and a segmentation mask is predicted; then, a cross entropy loss function of a pixel level is adopted, and the whole model is trained end to end by utilizing the prediction segmentation mask and the Ground Truth segmentation mask.

Further, the end-to-end training is divided into two stages;

1) training with a low-resolution scale without up-sampling, and then training with a high resolution; the basic parameters are set as follows: the word embedding size and the hidden state size are set to be 1024, and the number of dynamic filters is set to be 10;

2) training is carried out by adopting an Adam optimizer, the initial learning rate is 1 multiplied by 10, the batch size is 1, and the initial learning rate of the Adam optimizer is 1 multiplied by 10^-5The graph convolution is set to be 3 layers, and the loss function adopts a cross entropy loss function.

The invention discloses a language-guided cross-modal instance segmentation method, which is used for obtaining more discriminative feature representation for instance segmentation by constructing a cross-modal attention-directed visual reasoning (CMVR) model. Firstly, the model can capture local and global detailed information and pay more attention to details; secondly, the cross-modal attention module adaptively aligns the information keywords in the expression with the important region in the input image, and facilitates matching between different modal features. Meanwhile, the constructed visual reasoning module can carry out graph-based reasoning by means of the relation words, and the designated objects are distinguished from other objects of the same category, so that instance segmentation can be realized more effectively.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a general framework diagram of the cross-modal attention-directed visual inference model of the present invention.

FIG. 3 is a schematic diagram of multi-scale feature optimization according to the present invention.

FIG. 4 is a schematic diagram of the visual inference segmentation of the present invention.

FIG. 5 is a schematic view of the attention visualization of the present invention.

FIG. 6 is a schematic diagram of the output prediction under the fixed image and the changed language expression according to the present invention.

FIG. 7 is a diagram illustrating the results of the language-guided cross-modal example segmentation in accordance with the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention provides a language-Guided Cross-Modal instance segmentation method, which constructs a Cross-Modal Attention Guided Visual Reasoning model (CMVR), wherein the model consists of three modules; firstly, a Multi-scale feature refinement (MSFR) module is constructed, small targets and large targets are processed, Multi-scale local and global optimization Visual characteristics are effectively captured, and are fused to generate Multi-scale features, so that Visual representation is enhanced. Secondly, introducing a Cross-Modal Attention (CMA) module to adaptively align the information keywords in the expression with the important information in the image; the CMA module highlights all entities of the linguistic description by calculating a weighted normalized attention value for the word; meanwhile, the relational context of the words in the expression is calculated to be used as a relational clue and associated with the related entities in the image. Finally, a Visual Reasoning (VR) module is provided, a fully-connected multi-modal graph is constructed by using multi-modal features and expressed relation clues, Visual Reasoning is carried out gradually on the graph, correct entities are highlighted, and other unrelated entities are inhibited; the VR module facilitates generation of a more discriminative multi-modal representation of features for reference to the segmentation. Therefore, the language-guided cross-modal instance segmentation method provided by the invention is based on the pre-trained Chinese word embedded matrix, fully utilizes the semantic information of the target, combines the visual characteristics of the target to form cross-modal attention, and realizes the cross-modal instance segmentation based on the Chinese context.

As shown in fig. 1, the overall flow framework of the language-guided cross-modal instance segmentation method of the present invention includes the following four parts:

firstly, visual features are extracted from an input image through a CNN encoder, and language features are extracted from an expression through a language encoder;

secondly, introducing an MSFR module, obtaining multi-scale refining information from the visual features extracted by the CNN encoder, wherein the multi-scale refining information comprises global and local multi-scale detail information, and fusing the information to enhance the representation capability of the visual features;

thirdly, constructing a CAM module, performing cross-modal interaction on the visual features obtained by the CNN encoder and the language features obtained by the language encoder, calculating cross-modal attention, and aligning the information keywords in the expression with corresponding areas in the input image;

and fourthly, fusing the visual features enhanced by the MSFR module with the language features obtained by the language encoder, inputting the fused features into the VR module, and generating a multi-modal feature representation for mask prediction, namely finally outputting a mask graph of the segmentation target.

As shown in fig. 2, for the overall framework diagram of the cross-modal attention-directed visual inference model proposed by the present invention, the model takes images and images as expressions as input, first, a CNN encoder (specifically, a ResNet101 structure can be used) is used to extract visual features (Feature maps) in the images, and Spatial Location features are spliced to obtain a new visual Feature vector. Then, a pre-trained language encoder (Word Embedding model) is utilized to process the quotation expression to obtain a Word Embedding vector (Word Embedding), and then the Word vector is input into the LSTM network to obtain a language feature vector. The MSFR module takes the visual features (Feature Maps) as input, acquires multi-scale detail information, namely Global multi-scale features (Global Feature Maps) and Local multi-scale features (Local Feature Maps) from the extracted visual Feature Maps, and splices and fuses the Global and Local multi-scale features (Global-Global) to realize the characteristic capability of the enhanced features. The language features and visual features extracted by the CNN encoder are used as inputs to the CMA module to capture inter-pattern dependencies between keywords in the quotation expression and different image areas. Also, the CMA module focuses on all entities mentioned in the expression, and thus may help align the entity words in the expression with important areas in the image. Then, constructing a fully connected multi-modal graph (Multimodal Feature) on the spatial region of the image; in the VR module, multi-modal information among nodes is transmitted through edge connection among the nodes; visual Reasoning (Visual Reasoning) is carried out on the multimodal map by using graph convolution according to the relation in the expression, related entities matched with language description are effectively highlighted, other unrelated entities are suppressed, so that a more discriminative multimodal feature representation is generated, the multimodal features are subjected to up-sampling by an ASPP (advanced Spatial gradient) decoder and a deconvolution method, and a segmentation mask is predicted.

For CNN encoders and speech encoders, the specific process of feature extraction is as follows:

1) inputting 3-channel images

Where W, H are the width and height of the image, respectively, and R represents a real number space of dimensions W × H × 3; visual features in a backbone network over CNN

Visual features by 1-by-1 convolution

To size conversion of

2) Input expression

e_l＝embedding(s_l) Formula 1

Wherein the embedding () represents a pre-trained word embedding model; then embedding the words into a bidirectional LSTM network, and coding the context of each word; context of a word indicates h_lBy passingConnecting its forward and backward hidden state vectors, as shown in equation 2:

el denotes the word-embedded representation of the ith word,

the language features of the whole expression are expressed as

{h_lDenotes the dimension of the word context, h_lThe implicit state of the represented ith word.

As shown in FIG. 3, a diagram of a multiscale feature refinement (MSFR) module that captures detail information x of visual features₁(ii) a The module consists of (Local PyroConv, Local Pyramid fusion branch) and (Global PyroConv, Global Pyramid fusion branch), each layer of branches relating to a different type of kernel size. The Local PyConv branch is mainly responsible for capturing detailed Local information; the Global PyConv branch is used for capturing Global detail information; the method comprises the following specific steps:

1) as shown in the upper dashed portion of fig. 3, the Local PyConv branch is primarily responsible for handling smaller sized entities and capturing Local detail information. It contains four convolution layers, each convolution kernel being 9 × 9, 7 × 7, 5 × 5 and 3 × 3 in size, respectively; the size of the convolution kernel increases from left to right, and the depth of the kernel decreases. Before inputting the feature maps into the Local PyConv branch, first applying 1 × 1 convolution reduces the number of feature maps to 512; then carrying out pyramid convolution of different kernels from left to right to obtain local detailed information under four scales; and finally, combining the information obtained under different scales by adopting 1-by-1 convolution. The branch is behind four convolutional layers, and also includes a normalization layer and a ReLU activation function.

2) As shown in the lower dotted line portion of fig. 3, the Global PyConv branch handles large-sized entities and captures Global detailed information; and meanwhile, the kernel can cover the whole feature map. Firstly, reducing the space size of input feature mapping to 9 × 9 by using self-adaptive average pooling, and reducing the number of feature mapping to 512 by using 1 × 1 convolution; then, cone convolution of different kernels is carried out in sequence; and finally, fusing information of different scales by using 1-by-1 convolution. In addition, before the adaptive average pooling, the feature map is reconstructed to the original size by upsampling using bilinear interpolation. The output property maps from the Local PyConv branch and Global PyConv branch are concatenated to obtain 1024 property maps containing Local and Global detailed information.

In the CAM module, the specific processing procedure is as follows:

3.1 divide all words in the sentence into four types including entity (Entry), Property (Property), Location (Location) and Unnecessary words (Uncecessary words), and calculate a four-dimensional vector to represent the probability that it is one of the four types, the definition of the probability vector is as shown in equation 3:

q_l＝sof tmax(W₁σ(W₀h_l+b₀)+b₁) Formula 3

Wherein the content of the first and second substances,

and

3.2 word-based features

And entities and generaSex word

Calculating cross-modal attention between each word and different image regions; the weighted normalization attention definition of words is shown in equations 4 and 5:

wherein the content of the first and second substances,

and

to transfer the matrix, D_nRepresenting a hyper-parameter, D_x、D_hThe dimensions, alpha, representing the visual and word features, respectively_lRepresenting cross-modal attention; beta is a_lIs a weighted normalized attention, representing the word s_lWeighted probabilities referring to relevant image regions; aggregating the attention weighted features of all the words to calculate the language feature C of the corresponding entity, wherein the calculation process is as follows:

h_lthe implicit state of the represented ith word.

FIG. 4 is a schematic diagram of a visual inference (VR) module in which a fully connected multimodal map is constructed over a spatial region of an image in order to effectively highlight relevant entities that match a linguistic description, and suppress other irrelevant entities. The multimodal map is a map representing G ═ (V, E, F, A), where

Is a set of vertices, v, containing multi-modal information of an entity_kRepresents the kth node, and K represents the number of nodes;

is a set of edges, e_i，jIs an edge corresponding to between neighboring nodes, i, j represents a neighboring node in the multi-modal graph;

is the collective property of the vertices, Fi represents a multi-modal feature in the ith node; multi-modal information between nodes is propagated through edge connections between nodes; and (4) visually reasoning the multi-modal graph according to the relation in the expression by graph convolution. The method comprises the following specific steps:

Y＝L2Norm((tanh(W_pC)⊙tanh(W_Xx))) 7

Wherein the content of the first and second substances,

and

indicates a trainable parameter,. indicates multiplication of corresponding elements of the matrix,. L2Norm (. cndot.) indicates a Norm of L2; dn, Dm and Dv respectively represent the dimensionality of a global language feature, a multi-modal feature and a visual feature, and R represents a real number space with Dn multiplied by Dm dimensionality or Dv multiplied by Dm dimensionality;

wherein the content of the first and second substances,

wherein the content of the first and second substances,

and

representing trainable parameters;

a feature vector representing a position relation word;

the normalized affinity matrix is then shown in equations 10 and 11:

A₁＝softmax(A_F) Formula 10

A₂＝softmax(A_F ^T) Formula 11

Wherein the content of the first and second substances,

and

A＝A₁A₂formula 12

4.3 in a fully connected multimodal graph, multimodal information of nodes is propagated through adjacency matrices. However, the information in some nodes is independent, and meanwhile, the unconstrained information propagation among the nodes may introduce a great deal of noise and redundant information, so that the result is disordered; therefore, in the invention, a gate is respectively generated for the nodes and the edges of the multi-modal graph so as to limit the information propagation between the nodes; side door g^vDefined as the sum of the weighted normalized attention values of the individual words in the expression, as shown in equation 13:

wherein, beta_lRepresents a weighted normalized attention;

wherein the content of the first and second substances,

b₁、b₂and

all of which are trainable parameters that are used to,

and

4.5, the multi-modal characteristics are subjected to up-sampling on the multi-modal characteristic graph by an ASPP (advanced Spatial clustering Pooling) decoder by adopting a deconvolution method, and a segmentation mask is predicted; then, performing end-to-end training on the whole model by adopting a pixel-level cross entropy loss function and utilizing a prediction segmentation mask and a group Truth segmentation mask;

the model is trained end-to-end, and the training is divided into two stages. Firstly, training with a low-resolution scale without up-sampling, and then training with a high resolution; the basic parameters are set as follows: the word embedding size and the hidden state size are both set to 1024, and the number of dynamic filters is set to 10. The Adamoptizer optimizer is adopted for training, the initial learning rate is 1 multiplied by 10, the batch size is 1, and the Adama optimizer uses the 1 multiplied by 10 initial learning rate^-5The volume of the graph is set to be 3 layers,the loss function is a cross-entropy loss function.

To better represent the advantages of cross-modal attention in the model, an attention thermodynamic diagram between the image and the representation is visualized, as shown in FIG. 5. In the figure, (a) is an original image, (b) and (c) are attention figures of only a single word and a complete expression in a test sample; (d) is the prediction result of the invention; (e) by manually modifying the expressed intent.

As shown in fig. 6, in the case of a fixed image, prediction of the result of the segmentation is performed by changing expressions, such as "bottom center brown donut" indicating the third donut in the third row and "top row middle rounded donut" indicating the second donut in the first row, so that segmentation can still be performed accurately. As can be seen from the figure, the model is able to correctly segment all instances, significantly highlighting the flexibility and adaptability of the proposed model.

As shown in fig. 7, each column in the graph represents the original image, the prediction result produced by the group-route, RNN-CNN as baseline (baseline) method, the prediction result produced by the baseline + CMA module, the prediction result produced by the baseline + MSRF module, and the prediction result produced by the present invention, respectively, for the cross-modal image segmentation result guided by the language in the present invention. Line 1 (f) of FIG. 7 indicates that the CMVR can segment objects that are heavily occluded and have no locator words; line 2 (f) of fig. 7 shows that CMVR can distinguish objects in a similar background. In contrast, the baseline model is able to correctly segment the instances to which the language refers, but confuses it with a similar context. As can be seen from line 3 of FIG. 7, the baseline model does not locate the orange in the graph (e.g., "lite orange to the right of the big orange in the back"), and the present invention successfully distinguishes the indicated objects in complex scenes with attention directed thereto. Line 1 (e) of FIG. 7 shows that the CMVR can capture global and local detailed information, accurately segment the reference points, and highlight the potential of the CMVR to perform relative position reasoning. Line 4 (f) of fig. 7 shows the ability of CMVR to accurately identify a reference object, which illustrates that the present invention can effectively reduce the edge blocking effect and highlight the details of the reference object.

The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make variations, modifications, additions or substitutions within the technical scope of the present invention.

Claims

1. A language-guided cross-modal instance segmentation method is characterized by comprising the following steps: the method comprises the following four steps:

2. The method of language-guided cross-modal instance segmentation according to claim 1, wherein: in step 1, the specific process of visual feature extraction is as follows:

1.1 input of 3-channel images

Visual features by 1-by-1 convolution

To size conversion of

3. The method of claim 2, wherein: in step 1, the specific process of language feature extraction is as follows:

1.2 input expressions

e_l＝embedding(s_l) Formula 1

el denotes the word-embedded representation of the ith word,

the language features of the whole expression are expressed as

4. The method of language-guided cross-modal instance segmentation according to claim 1, wherein: in step 2, the MSFR module is constructed to capture the detail information x of the visual characteristic₁(ii) a The module consists of a LocalPyConv branch and a Global PyConv branch; the Local PyConv branch is responsible for capturing Local detailed information; the Global PyConv branch is used for capturing Global detail information; the output features of the Local PyConv branch and the global PyConv branch are fused by convolution of 1 x 1 to obtain a 1024-dimensional feature map containing Local and global detail information.

5. The method of language-guided cross-modal instance segmentation according to claim 1, wherein: the specific treatment process of the step 3 is as follows:

q_l＝softmax(W₁σ(W₀h_l+b₀)+b₁) Formula 3

Wherein the content of the first and second substances,

and

represents a trainable parameter, σ represents a learnable parameter; a normalized exponential function represented by Softmax ();D_l、D_hrespectively representing the word classification and the dimension of the feature;

respectively represent words S₁Probabilities of entities, attributes, relationships, and unnecessary words;

3.2 word-based features

And entity and attribute words

wherein the content of the first and second substances,

and

h_lthe implicit state of the represented ith word.

6. The method of claim 4, wherein: the specific treatment process of the step 4 is as follows:

Y＝L2Norm((tanh(W_pC)⊙tanh(W_Xx))) of formula₇

Wherein the content of the first and second substances,

and

wherein the content of the first and second substances,

the definition of the affinity matrix between the multi-modal features of the node and the weighted relation context of the word is different from that of equation 9:

wherein the content of the first and second substances,

and

representing trainable parameters;

a feature vector representing a position relation word;

the normalized affinity matrix is then shown in equations 10 and 11:

A₁＝softmax(A_F) Formula 10

A₂＝softmax(A_F ^T) Formula 11

Wherein the content of the first and second substances,

and

respectively through affinity matrix A_FThe row normalization and the column normalization of A are carried out to obtain_F ^T(ii) a The adjacency matrix is represented as formula 12:

A＝A₁A₂formula (II)₁₂

4.3 respectively generating a gate for the nodes and the edges of the multi-modal graph so as to limit the information propagation between the nodes; side door g^vDefined as the sum of the weighted normalized attention values of the individual words in the expression, as opposed to equation 13:

wherein, beta_lRepresents a weighted normalized attention;

wherein the content of the first and second substances,

b₁、b₂and

all of which are trainable parameters that are used to,

and

7. The method of claim 6, wherein: the end-to-end training is divided into two stages;

2) the Adam optimizer is adopted for training, the initial learning rate is 1 x 10, the batch size is 1, the Adam optimizer uses 1 x 10 for the initial learning rate, the graph convolution is set to be 3 layers, and the loss function adopts a cross entropy loss function.