CN115496991A

CN115496991A - Reference expression understanding method based on multi-scale cross-modal feature fusion

Info

Publication number: CN115496991A
Application number: CN202211009462.0A
Authority: CN
Inventors: 王鹏; 孙梦阳; 张艳宁; 索伟
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-20

Abstract

The invention relates to a multi-scale cross-modal feature fusion-based expression understanding method, which belongs to the field of language image multi-modal fusion. In the cross-modal feature fusion process, a linear feature modulation and visual guidance language attention module is combined for feature fusion, and meanwhile, the model utilizes language to select and screen multi-scale grid-level features, adaptively selects key clues from low-level and high-level fusion features, and establishes a relation between cross-scale information through dynamic routing. Experimental results show that the new model architecture reaches the new most advanced level in a plurality of benchmark tests, and provides new insights and directions for REC research.

Description

Reference expression understanding method based on multi-scale cross-modal feature fusion

Technical Field

The invention belongs to the field of multi-modal fusion based on language images, and particularly relates to a reference expression understanding method based on multi-scale cross-modal feature fusion.

Background

The Return Expression Comprehension (REC) is a challenging and significant task in the field of computer vision. The task requires that a computer obtains a target area corresponding to description through inference analysis according to given images and natural language description, and is a basic task in multiple fields of man-machine interaction, visual question answering, image retrieval and the like. Unlike traditional object detection, which classifies fixed objects using predefined class labels, REC involves a comprehensive understanding of complex natural languages and variable images. The research field of REC focuses mainly on the strategy of fusion, the stage of fusion and the design of detection heads. However, although REC is a cross-modal task, it is still closely linked to object detection.

Modern object detection models can be roughly converted into three stages: a network backbone, a neck, and a detection head. For the neck part, experience with inspection models has demonstrated that assembling a well-matched Feature Pyramid Network (FPN) is important to close semantic gaps and improve performance. However, on the one hand, the REC models in the present stage are usually only to use an averaging strategy to fuse multi-scale features, even to use single-scale features, and the research on the fusion of multi-scale features is far from enough. On the other hand, because the low-level features contain more attribute information such as color, texture and the like, and the high-level features have rich semantic information, the multi-scale feature fusion can combine the advantages of the low-level features and the high-level features to be suitable for different language expressions. The invention provides a multi-scale fusion method suitable for REC tasks, which helps model reasoning and accurately positions targets with different sizes.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a reference expression understanding method based on multi-scale cross-modal feature fusion. The method can dynamically allocate and select fine-grained information from the multi-scale feature map by using the linguistic gate and the joint gate. Is a more effective and reliable method for understanding the expression of the expression.

Technical scheme

A method for understanding expression based on multi-scale cross-modal feature fusion is characterized by comprising the following steps:

step 1: the pictures are firstly adjusted to the same sizeThen extracting n scales of feature maps through Resnet-101, and mapping the feature maps to the same dimension d through 1 multiplied by 1 convolution to obtain

For language information, firstly decomposing the language information into words, embedding the words to obtain a characteristic vector corresponding to each word, defining the longest sentence word number as T, and filling the blank of the sentence with less word numbers than T with PAD marks; adding a CLS mark at the beginning position of the statement, and adding an SEP mark at the end of the statement; inputting the word vectors subjected to position coding into a BERT network to obtain the feature vectors of all words and phrases fused with statement information

And 2, step: inputting E and V into a cross-mode interactive attention module of the model, wherein the cross-mode interactive attention module consists of two parts, namely a linear feature modulation module (FilM) and a visual-guided language attention module; in the FiLM module, a feature-based affine transformation is applied to adaptively influence the output of the network for a given language feature

Obtaining the whole expression E through an averaging strategy _F And then specifically by:

wherein W _i ^γ ，W _i ^β ，

And

weights and deviations of two multilayer perceptrons MLPs with activation function Tanh, for equations (3), <' > and

respectively representing element-wise multiplication and addition; finally, standard 3 × 3 convolution and ReLU operations are applied to generate a multi-level fusion feature

For the visual-guided language attention module, the visual features Vi are first flattened into

Wherein N is _i ＝H _i ×W _i Is the number of features of the visual marker, and then is based on the language feature E and the visual feature Z _i Calculated from the following formula:

wherein W _i ^Q 、

And

is an embedded matrix, with queries, keys and values in the attention module denoted by Q, K and V, respectively;

m is the number of attention heads and d is the characteristic dimension, for simplicityOnly one language attention module is used for each level of visual features; then, A _i Further encoded by two feedforward networks FFN with residual connection to form a fused output

By means of a connection F _f And F _t To obtain F _ft Then F was laminated using three 1X 1 convolutional layers _ft Mapping to dimension d; finally, the combined characteristics are obtained

And step 3: constructing language-guided feature pyramid module FPN

Firstly, constructing a routing space with the depth of K, wherein in the routing space, the scale factor between adjacent stages is limited to 2; for each routing node, the input consists of two parts: a multi-level feature map and an attention-based language vector; the grid-level features of each scale in each routing node are selected by the language gate;

first, the input through the language gate consists of two parts: a multi-level feature map and an attention-based language vector; the language vector based on attention mechanism is obtained by the following formula:

a ^k ＝softmax(EW _k ) (5)

wherein W _k ∈R ^256×1 Are the learning weights, k represents the depth,

sharing to each scale and grid feature; the multi-scale feature map can be expressed as

Where i is the ith dimension, k is the kth layer, N = (H) _i ×W _i ) (ii) a Language gate pass through

Dynamic selection of language vectors

The grid level features in (1) specifically operate as follows:

* And · respectively representing convolution operations and Hadamard products; conv (-) denotes a 3 × 3 convolutional network, σ (-) is the activation function; here, tanh, i.e., max (0,tahn (·)) is used as the door switch; when the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase;

then, the output Y is ^i，k Respectively carrying out up-sampling, unchanged holding and down-sampling operations according to the small scale and the large scale; the specific operation is as follows:

use of

To represent the aggregated output in routing node I,

the fine-grained features in (1) are further refined by the associative gate, specifically,

will be calculated by:

where conv is a 1 x 1 convolutional network that maps the input features into a channel, l denotes the ith node; the nodes of the last layer are used for multi-scale fusion, and the fusion mode is as follows:

formula (11) summarizes the information of different nodes to obtain F _AVG It will be used as an input to the detection head;

and 4, step 4: locating a target using an anchorless sensing head

For output F of step 3 _AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown _x ，t _y ，t _w ，t _h t, where the first two values represent the center offset, t _w And t _h Respectively, normalized width and height; the last t is a confidence score which represents whether the central point of the object exists at the position or not; finally, applying the cross entropy loss L on the center point t _cls， Applying MSE loss L in center offset, width and height _off (ii) a Simultaneously, the GIoU loss is used as an auxiliary loss; finally, the whole function is defined as:

L _off ＝(Δx-t _x ) ² +(Δ _y -t _y ) ² (14)

C _ij =1 or 0 denotes whether or not the current cell contains true targetThe center point of the central point is provided with a central point,

represents the offset of the center point to the center of the grid, wherein x and y refer to

int (·) denotes that this operation rounds the fraction to the nearest integer; performing L only on the grid where the truth value center is positioned _off (ii) a The total loss function is as follows:

Loss＝L _cls +λ _off L _off +L _giou (15)

wherein λ is _off Setting to 5, the network selects the central point with the highest score to generate a bounding box; the IoU is a metric used in REC to measure the degree of overlap between prediction and reality.

A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the above-described method.

A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.

Advantageous effects

The invention provides a reference expression understanding method based on multi-scale cross-modal feature fusion, which utilizes innovative and effective multi-scale cross-modal fusion to carry out reference expression understanding. Different from the previous model, the model combines linear feature modulation and a visual guidance language attention module to perform feature fusion in the cross-modal feature fusion process. Meanwhile, the model utilizes language to select and screen multi-scale grid-level features, self-adaptively selects key clues from low-level and high-level fusion features, and establishes a relation between cross-scale information through dynamic routing. Experimental results show that the new model architecture reaches the new most advanced level in a plurality of benchmark tests, and provides new insights and directions for REC research.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

FIG. 1 is a schematic diagram of a stage-named expression understanding network structure of the method of the present invention.

FIG. 2 understands the results based on the nomenclature expression of the multi-scale cross-modal feature fusion mechanism.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The technical scheme of the invention comprises the following modules: the first part is an extraction and coding module for the language and image information, the second part is a cross-modal feature fusion module based on linear feature modulation (FiLM) and a language-guided visual attention mechanism, the third part is a multi-scale feature fusion module based on language guidance, and the fourth part is a positioning process of a target. In the first part, a Resnet-101 convolutional neural network and a BERT pre-training model are adopted to respectively extract the features of the picture information and the language information. In the second part, the cross-modal features are respectively fused by adopting the FilM and the language-guided visual attention, and then the cross-modal features and the language-guided visual attention are mapped to obtain final fusion features. In the third part, the characteristics with different scales are selected for fusion by using a language gate and a joint gate to execute a dynamic routing strategy in a multi-scale characteristic fusion module guided by the language. And finally, sending the fused multi-scale features into a prediction head to obtain a target object region.

Based on the above modules, the embodiment of the present invention provides a one-stage cross-modal attention mechanism and language-guided multi-scale fusion-combined candidate-box-free expression method, which includes the following specific processes:

step 1: the pictures are firstly adjusted to be the same in size, then n scales of feature maps are extracted through Resnet-101, and the feature maps are mapped to the same dimension d through 1 multiplied by 1 convolution to obtain the picture

For language information, the language information is decomposed into words, and the feature vectors corresponding to the words are obtained after the words are embedded. And (5) specifying that the longest sentence word number is T, and filling the blank of the sentence with less word number than T with PAD marks. CLS marks are added at the beginning positions of the sentences, and SEP marks are added at the ends of the sentences. Inputting the word vectors subjected to position coding into a BERT network to obtain the feature vectors of all words and phrases fused with statement information

Step 2: e and V are input into a cross-modal interactive attention module of the model. The module consists of two parts, a linear feature modulation module (FiLM) and a visual-guided language attention module. In the FiLM module, a feature-based affine transformation is applied to adaptively affect the output of the network. For a given language feature

Firstly, obtaining the whole expression E through a simple averaging strategy _F And then specifically by:

wherein W _i ^γ ，W _i ^β ，

And

are the weights and the deviations of the two multi-layer perceptrons MLP with activation functions Tanh. For equation (3). An as

Respectively representing element-wise multiplication and addition. Finally, standard 3 × 3 convolution and ReLU operations are applied to generate the multi-level fusion features

Wherein N is _i ＝H _i ×W _i Is the number of features of the visual indicia. Then according to the language characteristic E and the visual characteristic Z _i Calculated from the following formula:

wherein W _i ^Q 、

And

is an embedded matrix.The queries, keys and values in the attention module are denoted by Q, K and V, respectively.

m is the number of attention heads and d is the characteristic dimension. For simplicity, only one language attention module is used for each level of visual features. Then, A _i Further encoded by two Feed Forward Networks (FFN) with residual connection to form a fused output

By connecting F _f And F _t To obtain F _ft Then F was laminated using three 1X 1 convolutional layers _ft Mapping to dimension d. Finally, the combined characteristics are obtained

And step 3: a language-guided feature pyramid module (FPN) is constructed.

Firstly, a routing space with the depth of K is constructed. In this routing space, the scaling factor between adjacent stages is limited to 2. For each routing node, the input consists of two parts: multi-level feature maps and attention-based language vectors. The grid-level (grid) features of each scale in each routing node are hard-chosen by the language gate. In addition, since REC is a cross-modal task, a data-dependent associative gate module is established in the node, which will further refine the grid based on the aggregated information.

First, the input through the language gate consists of two parts: multi-level feature maps and attention-based language vectors. The language vector based on attention mechanism is obtained by the following formula:

a ^k ＝softmax(EW _k ) (5)

wherein W _k ∈R ^256×1 Are the learning weights, k represents the depth,

will be shared to each scale and grid feature. The multi-scale feature map can be expressed as

Where i is the ith dimension, k is the kth layer, N = (H) _i ×W _i ). Language gate pass

Dynamic selection of language vectors

The specific operation of the grid-level features in (1) is as follows:

* And · denote convolution operation and Hadamard product, respectively. conv (-) represents a 3 × 3 convolutional network, σ (-) is the activation function. Here, tanh, i.e., max (0,tahn (·)) is used as the door switch. When the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase.

Then, the output Y is output ^i，k And respectively carrying out up-sampling, keeping unchanged and down-sampling operations from a small scale to a large scale. The specific operation is as follows:

use of

To represent the aggregate output in routing node I, in order to improve the efficiency of this module deep network, a bottleneck module (bottleeck) with residual connections is used, in which,

the fine-grained features in (1) are further refined by the associative gates. In particular, it is possible to use, for example,

will be calculated by:

where conv is a 1 x 1 convolutional network that maps the input features into a channel, and l denotes the l-th node. The nodes of the last layer will be used as multi-scale fusion. The fusion mode is as follows:

formula (11) summarizes the information of different nodes to obtain F _AVG It will be used as an input to the detection head.

And 4, step 4: an anchorless sensing head is used to locate the target. For output F of step 3 _AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown _x ，t _y ，t _w ，t _h t, where the first two values represent the center offset, t _w And t _h Respectively, the normalized width and height. The last t is the confidence score, indicating whether the center point of the object exists at that location. Finally, applying the cross entropy loss L on the center point t _cls Applying MS on center offset, width and heightE loss L _off . At the same time, the GIoU loss is used as a supplemental loss. Finally, the whole function is defined as:

L _off ＝(Δx-t _x ) ² +(Δ _y -t _y ) ² (14)

C _ij =1 or 0 indicates whether the current grid contains the center point of the true target.

int (·) denotes that this operation rounds the fraction to the nearest integer. Performing L only on the grid where the truth center is located _off . The total loss function is as follows:

Loss＝L _cls +λ _off L _off +L _giou (15)

wherein λ is _off Set to 5, the network selects the center point of the highest score to generate the bounding box. The interaction-over-Union (IoU) is a metric used in REC to measure the degree of overlap between prediction and reality. Following the previous work, the present invention uses IOU @0.5 to measure prediction accuracy.

Example 1:

1. image feature extraction

Given a picture in a natural scene, the whole picture is adjusted to 640 x 640 and input into a feature extraction network for forward propagation. The embodiment adopts Resnet-101 to extract the image features. Three scale features 20 × 20 × 2048, 40 × 40 × 1024 and 80 × 80 × 512 are obtained, and then the three scale feature maps are mapped to the same dimension d =256 using 1 × 1 convolution.

2. Extraction of language features

The sentence information is decomposed into words, and the feature vectors corresponding to the words are obtained after the words are embedded. In this embodiment, the longest term number is 20. Then the word vector after position coding is input into a BERT network to obtain the characteristic vector E of each vocabulary of the fusion statement information, wherein E belongs to R ^20×256 。

3. Feature fusion with cross-modal attention

The image features are expanded into vectors of (400 × 256), (1600 × 256), (6400 × 256) dimensions, and input into the cross-modal fusion module together with the language features (20 × 256). In the language-guided visual attention module, the language features are input as Q and the image features are input as K, V. Each module consists of 2 identical self-attentive layers, the number of attentive heads being 8. Affine transformation is performed on a given language feature in FiLM to obtain a fused feature. And finally, splicing the two fusion features and remapping the two fusion features to the dimension d by using 1 multiplied by 1 convolution to obtain the final fusion feature.

4. Multi-scale fusion based on language guidance

Inputting the feature graph and the language feature of three scales which are fused in the previous stage into a routing node, firstly screening the grids of the feature graph by using the language through a language gate, then performing up-sampling and down-sampling, and sending the original resolution, up-sampling and down-sampling feature graph into a joint gate for information summary. And the effectiveness of the deep network is ensured by a bottleneck module with residual connection between routing nodes of each layer, and six layers are provided. And then in the last layer of routing nodes, the feature graphs of the three scales acquire the final fusion feature through an average strategy.

5. Target localization

For the output of the previous stage, we first obtained a w × h × 5 feature map using a 1 × 1 convolutional layer, showing five predicted values { t } _x ，t _y ，t _w ，t _h t, where the first two values represent the center offset, t _w And t _h Normalized weight and height are indicated, respectively. Last t is setAnd the confidence score represents whether the center point of the object exists in the position. During the reasoning process, the central point with the highest score is selected by the network to generate a bounding box.

6. Model training

The whole training process is end-to-end training. In the embodiment, four training sets of RefCOCO, refCOCO +, refcocg and refertitgame are used as indexes for model training and evaluation. Using the Adam method as an optimizer, batch is set to 8 and the initial learning rate is set to 1e-4. This example performed 20 rounds of training on a 1080Ti GPU with a half-reduced learning rate per 10 rounds of training.

8. Model application

After the training process, a plurality of models can be obtained, the optimal model (the test effect on the test set is optimal) is selected for application, and for the input images and sentences, the images are only required to be adjusted to 640 x 640 size, normalization is carried out, and the sentences are subjected to word segmentation operation to be used as the input of the models. The parameters of the whole network model are fixed, and only image data and language data are input and propagated forwards. And sequentially obtaining image and language feature vectors V and E, and then automatically transmitting the image and language feature vectors V and E into a cross-modal feature module, a language-guided multi-scale fusion module and a positioning module to directly obtain a prediction result. The actual effect graph is shown in fig. 2, and the accurate position of the description information of the related sentence in the image can be efficiently given based on the method.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims

1. A method for understanding expression based on multi-scale cross-modal feature fusion is characterized by comprising the following steps:

step 1: firstly, adjusting the pictures to the same size, then extracting n-scale feature maps through Resnet-101, and mapping the feature maps to the same size through 1 multiplied by 1 convolutionDimension d is obtained

For language information, firstly decomposing the language information into words, embedding the words to obtain a characteristic vector corresponding to each word, defining the longest sentence word number as T, and filling the blank of the sentence with less word numbers than T with PAD marks; adding a CLS mark at the beginning position of the statement, and adding an SEP mark at the end of the statement; inputting the word vectors subjected to position coding into a BERT network to obtain feature vectors of all words and phrases fused with statement information

Step 2: inputting E and V into a cross-mode interactive attention module of the model, wherein the cross-mode interactive attention module consists of two parts, namely a linear feature modulation module (FilM) and a visual-guided language attention module; in the FiLM module, a feature-based affine transformation is applied to adaptively influence the output of the network for a given language feature

wherein

And

weights and offsets for two multi-layer perceptron MLPs with activation function Tanh, for equations (3), < > and < > indicate bit-by-bit multiplication and addition by element, respectively; finally, standard 3 × 3 convolution and ReLU operations are applied to generate the multi-level fusion features

For the visual guided language attention module, the visual feature Vi is first flattened to

wherein

And

m is the number of attention heads, d is the feature dimension, for simplicity, only one language attention module is used for each level of visual features; then, A _i Is composed of twoThe FFN is further encoded with a residual connection to form a fused output

By connecting F _f And F _t To obtain F _ft Then using three 1X 1 convolutional layers to convert F _ft Mapping to dimension d; finally, the combined characteristics are obtained

And step 3: constructing language-guided feature pyramid module FPN

Firstly, constructing a routing space with the depth of K, wherein in the routing space, the scale factor between adjacent stages is limited to 2; for each routing node, the input consists of two parts: a multi-level feature map and attention-based language vectors; the grid-level features of each scale in each routing node are hard-selected by the linguistic gate;

a ^k ＝softmax(EW _k ) (5)

wherein W _k ∈R ^256×1 Are the learning weights, k represents the depth,

Where i is the ith dimension, k is the kth layer, N = (H) _i ×W _i ) (ii) a Language (1)Speak the door through

Dynamic selection of language vectors

The grid level features in (1) specifically operate as follows:

* And · denote convolution operation and Hadamard product, respectively; conv (. Cndot.) represents a 3 × 3 convolutional network, σ (. Cndot.) is the activation function; here, tanh, i.e., max (0,tahn (·)), is used as the door switch; when the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase;

then, the output Y is output ^i,k Respectively carrying out up-sampling, unchanged holding and down-sampling operations from a small scale to a large scale; the specific operation is as follows:

use of

To represent the aggregated output in routing node I,

will be calculated by:

and 4, step 4: locating a target using an anchorless sensing head

For the output F of step 3 _AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown _x ,t _y ,t _w ,t _h t, where the first two values represent the center offset, t _w And t _h Respectively, normalized width and height; the last t is a confidence score representing whether the center point of the object exists at the position; finally, applying the cross entropy loss L on the center point t _cls Applying MSE loss L in center offset, width and height _off (ii) a Simultaneously, the GIoU loss is used as an auxiliary loss; finally, the whole function is defined as:

L _off ＝(Δx-t _x ) ² +(Δy-t _y ) ² (14)

C _ij =1 or 0 indicates whether the current grid contains the center point of the true target,

Loss＝L _cls +λ _oFF L _off +L _giou (15)

wherein λ is _off Setting to 5, the network selects the central point of the highest score to generate a bounding box; the IoU is a metric used in REC to measure the degree of overlap between prediction and reality.

2. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

3. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed, perform the method of claim 1.