CN115496991A - Reference expression understanding method based on multi-scale cross-modal feature fusion - Google Patents
Reference expression understanding method based on multi-scale cross-modal feature fusion Download PDFInfo
- Publication number
- CN115496991A CN115496991A CN202211009462.0A CN202211009462A CN115496991A CN 115496991 A CN115496991 A CN 115496991A CN 202211009462 A CN202211009462 A CN 202211009462A CN 115496991 A CN115496991 A CN 115496991A
- Authority
- CN
- China
- Prior art keywords
- language
- feature
- scale
- features
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 43
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000014509 gene expression Effects 0.000 title claims abstract description 19
- 230000000007 visual effect Effects 0.000 claims abstract description 20
- 239000013598 vector Substances 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 15
- 238000005070 sampling Methods 0.000 claims description 10
- 238000001514 detection method Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000003550 marker Substances 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 4
- 238000012360 testing method Methods 0.000 abstract description 4
- 238000007499 fusion processing Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/86—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a multi-scale cross-modal feature fusion-based expression understanding method, which belongs to the field of language image multi-modal fusion. In the cross-modal feature fusion process, a linear feature modulation and visual guidance language attention module is combined for feature fusion, and meanwhile, the model utilizes language to select and screen multi-scale grid-level features, adaptively selects key clues from low-level and high-level fusion features, and establishes a relation between cross-scale information through dynamic routing. Experimental results show that the new model architecture reaches the new most advanced level in a plurality of benchmark tests, and provides new insights and directions for REC research.
Description
Technical Field
The invention belongs to the field of multi-modal fusion based on language images, and particularly relates to a reference expression understanding method based on multi-scale cross-modal feature fusion.
Background
The Return Expression Comprehension (REC) is a challenging and significant task in the field of computer vision. The task requires that a computer obtains a target area corresponding to description through inference analysis according to given images and natural language description, and is a basic task in multiple fields of man-machine interaction, visual question answering, image retrieval and the like. Unlike traditional object detection, which classifies fixed objects using predefined class labels, REC involves a comprehensive understanding of complex natural languages and variable images. The research field of REC focuses mainly on the strategy of fusion, the stage of fusion and the design of detection heads. However, although REC is a cross-modal task, it is still closely linked to object detection.
Modern object detection models can be roughly converted into three stages: a network backbone, a neck, and a detection head. For the neck part, experience with inspection models has demonstrated that assembling a well-matched Feature Pyramid Network (FPN) is important to close semantic gaps and improve performance. However, on the one hand, the REC models in the present stage are usually only to use an averaging strategy to fuse multi-scale features, even to use single-scale features, and the research on the fusion of multi-scale features is far from enough. On the other hand, because the low-level features contain more attribute information such as color, texture and the like, and the high-level features have rich semantic information, the multi-scale feature fusion can combine the advantages of the low-level features and the high-level features to be suitable for different language expressions. The invention provides a multi-scale fusion method suitable for REC tasks, which helps model reasoning and accurately positions targets with different sizes.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a reference expression understanding method based on multi-scale cross-modal feature fusion. The method can dynamically allocate and select fine-grained information from the multi-scale feature map by using the linguistic gate and the joint gate. Is a more effective and reliable method for understanding the expression of the expression.
Technical scheme
A method for understanding expression based on multi-scale cross-modal feature fusion is characterized by comprising the following steps:
step 1: the pictures are firstly adjusted to the same sizeThen extracting n scales of feature maps through Resnet-101, and mapping the feature maps to the same dimension d through 1 multiplied by 1 convolution to obtainFor language information, firstly decomposing the language information into words, embedding the words to obtain a characteristic vector corresponding to each word, defining the longest sentence word number as T, and filling the blank of the sentence with less word numbers than T with PAD marks; adding a CLS mark at the beginning position of the statement, and adding an SEP mark at the end of the statement; inputting the word vectors subjected to position coding into a BERT network to obtain the feature vectors of all words and phrases fused with statement information
And 2, step: inputting E and V into a cross-mode interactive attention module of the model, wherein the cross-mode interactive attention module consists of two parts, namely a linear feature modulation module (FilM) and a visual-guided language attention module; in the FiLM module, a feature-based affine transformation is applied to adaptively influence the output of the network for a given language featureObtaining the whole expression E through an averaging strategy F And then specifically by:
wherein W i γ ,W i β ,Andweights and deviations of two multilayer perceptrons MLPs with activation function Tanh, for equations (3), <' > andrespectively representing element-wise multiplication and addition; finally, standard 3 × 3 convolution and ReLU operations are applied to generate a multi-level fusion feature
For the visual-guided language attention module, the visual features Vi are first flattened intoWherein N is i =H i ×W i Is the number of features of the visual marker, and then is based on the language feature E and the visual feature Z i Calculated from the following formula:
wherein W i Q 、Andis an embedded matrix, with queries, keys and values in the attention module denoted by Q, K and V, respectively;m is the number of attention heads and d is the characteristic dimension, for simplicityOnly one language attention module is used for each level of visual features; then, A i Further encoded by two feedforward networks FFN with residual connection to form a fused output
By means of a connection F f And F t To obtain F ft Then F was laminated using three 1X 1 convolutional layers ft Mapping to dimension d; finally, the combined characteristics are obtained
And step 3: constructing language-guided feature pyramid module FPN
Firstly, constructing a routing space with the depth of K, wherein in the routing space, the scale factor between adjacent stages is limited to 2; for each routing node, the input consists of two parts: a multi-level feature map and an attention-based language vector; the grid-level features of each scale in each routing node are selected by the language gate;
first, the input through the language gate consists of two parts: a multi-level feature map and an attention-based language vector; the language vector based on attention mechanism is obtained by the following formula:
a k =softmax(EW k ) (5)
wherein W k ∈R 256×1 Are the learning weights, k represents the depth,sharing to each scale and grid feature; the multi-scale feature map can be expressed asWhere i is the ith dimension, k is the kth layer, N = (H) i ×W i ) (ii) a Language gate pass throughDynamic selection of language vectorsThe grid level features in (1) specifically operate as follows:
* And · respectively representing convolution operations and Hadamard products; conv (-) denotes a 3 × 3 convolutional network, σ (-) is the activation function; here, tanh, i.e., max (0,tahn (·)) is used as the door switch; when the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase;
then, the output Y is i,k Respectively carrying out up-sampling, unchanged holding and down-sampling operations according to the small scale and the large scale; the specific operation is as follows:
use ofTo represent the aggregated output in routing node I,the fine-grained features in (1) are further refined by the associative gate, specifically,will be calculated by:
where conv is a 1 x 1 convolutional network that maps the input features into a channel, l denotes the ith node; the nodes of the last layer are used for multi-scale fusion, and the fusion mode is as follows:
formula (11) summarizes the information of different nodes to obtain F AVG It will be used as an input to the detection head;
and 4, step 4: locating a target using an anchorless sensing head
For output F of step 3 AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Respectively, normalized width and height; the last t is a confidence score which represents whether the central point of the object exists at the position or not; finally, applying the cross entropy loss L on the center point t cls, Applying MSE loss L in center offset, width and height off (ii) a Simultaneously, the GIoU loss is used as an auxiliary loss; finally, the whole function is defined as:
L off =(Δx-t x ) 2 +(Δ y -t y ) 2 (14)
C ij =1 or 0 denotes whether or not the current cell contains true targetThe center point of the central point is provided with a central point, represents the offset of the center point to the center of the grid, wherein x and y refer toint (·) denotes that this operation rounds the fraction to the nearest integer; performing L only on the grid where the truth value center is positioned off (ii) a The total loss function is as follows:
Loss=L cls +λ off L off +L giou (15)
wherein λ is off Setting to 5, the network selects the central point with the highest score to generate a bounding box; the IoU is a metric used in REC to measure the degree of overlap between prediction and reality.
A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the above-described method.
A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.
Advantageous effects
The invention provides a reference expression understanding method based on multi-scale cross-modal feature fusion, which utilizes innovative and effective multi-scale cross-modal fusion to carry out reference expression understanding. Different from the previous model, the model combines linear feature modulation and a visual guidance language attention module to perform feature fusion in the cross-modal feature fusion process. Meanwhile, the model utilizes language to select and screen multi-scale grid-level features, self-adaptively selects key clues from low-level and high-level fusion features, and establishes a relation between cross-scale information through dynamic routing. Experimental results show that the new model architecture reaches the new most advanced level in a plurality of benchmark tests, and provides new insights and directions for REC research.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a schematic diagram of a stage-named expression understanding network structure of the method of the present invention.
FIG. 2 understands the results based on the nomenclature expression of the multi-scale cross-modal feature fusion mechanism.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical scheme of the invention comprises the following modules: the first part is an extraction and coding module for the language and image information, the second part is a cross-modal feature fusion module based on linear feature modulation (FiLM) and a language-guided visual attention mechanism, the third part is a multi-scale feature fusion module based on language guidance, and the fourth part is a positioning process of a target. In the first part, a Resnet-101 convolutional neural network and a BERT pre-training model are adopted to respectively extract the features of the picture information and the language information. In the second part, the cross-modal features are respectively fused by adopting the FilM and the language-guided visual attention, and then the cross-modal features and the language-guided visual attention are mapped to obtain final fusion features. In the third part, the characteristics with different scales are selected for fusion by using a language gate and a joint gate to execute a dynamic routing strategy in a multi-scale characteristic fusion module guided by the language. And finally, sending the fused multi-scale features into a prediction head to obtain a target object region.
Based on the above modules, the embodiment of the present invention provides a one-stage cross-modal attention mechanism and language-guided multi-scale fusion-combined candidate-box-free expression method, which includes the following specific processes:
step 1: the pictures are firstly adjusted to be the same in size, then n scales of feature maps are extracted through Resnet-101, and the feature maps are mapped to the same dimension d through 1 multiplied by 1 convolution to obtain the pictureFor language information, the language information is decomposed into words, and the feature vectors corresponding to the words are obtained after the words are embedded. And (5) specifying that the longest sentence word number is T, and filling the blank of the sentence with less word number than T with PAD marks. CLS marks are added at the beginning positions of the sentences, and SEP marks are added at the ends of the sentences. Inputting the word vectors subjected to position coding into a BERT network to obtain the feature vectors of all words and phrases fused with statement information
Step 2: e and V are input into a cross-modal interactive attention module of the model. The module consists of two parts, a linear feature modulation module (FiLM) and a visual-guided language attention module. In the FiLM module, a feature-based affine transformation is applied to adaptively affect the output of the network. For a given language featureFirstly, obtaining the whole expression E through a simple averaging strategy F And then specifically by:
wherein W i γ ,W i β ,Andare the weights and the deviations of the two multi-layer perceptrons MLP with activation functions Tanh. For equation (3). An asRespectively representing element-wise multiplication and addition. Finally, standard 3 × 3 convolution and ReLU operations are applied to generate the multi-level fusion features
For the visual-guided language attention module, the visual features Vi are first flattened intoWherein N is i =H i ×W i Is the number of features of the visual indicia. Then according to the language characteristic E and the visual characteristic Z i Calculated from the following formula:
wherein W i Q 、Andis an embedded matrix.The queries, keys and values in the attention module are denoted by Q, K and V, respectively.m is the number of attention heads and d is the characteristic dimension. For simplicity, only one language attention module is used for each level of visual features. Then, A i Further encoded by two Feed Forward Networks (FFN) with residual connection to form a fused output
By connecting F f And F t To obtain F ft Then F was laminated using three 1X 1 convolutional layers ft Mapping to dimension d. Finally, the combined characteristics are obtained
And step 3: a language-guided feature pyramid module (FPN) is constructed.
Firstly, a routing space with the depth of K is constructed. In this routing space, the scaling factor between adjacent stages is limited to 2. For each routing node, the input consists of two parts: multi-level feature maps and attention-based language vectors. The grid-level (grid) features of each scale in each routing node are hard-chosen by the language gate. In addition, since REC is a cross-modal task, a data-dependent associative gate module is established in the node, which will further refine the grid based on the aggregated information.
First, the input through the language gate consists of two parts: multi-level feature maps and attention-based language vectors. The language vector based on attention mechanism is obtained by the following formula:
a k =softmax(EW k ) (5)
wherein W k ∈R 256×1 Are the learning weights, k represents the depth,will be shared to each scale and grid feature. The multi-scale feature map can be expressed asWhere i is the ith dimension, k is the kth layer, N = (H) i ×W i ). Language gate passDynamic selection of language vectorsThe specific operation of the grid-level features in (1) is as follows:
* And · denote convolution operation and Hadamard product, respectively. conv (-) represents a 3 × 3 convolutional network, σ (-) is the activation function. Here, tanh, i.e., max (0,tahn (·)) is used as the door switch. When the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase.
Then, the output Y is output i,k And respectively carrying out up-sampling, keeping unchanged and down-sampling operations from a small scale to a large scale. The specific operation is as follows:
use ofTo represent the aggregate output in routing node I, in order to improve the efficiency of this module deep network, a bottleneck module (bottleeck) with residual connections is used, in which,the fine-grained features in (1) are further refined by the associative gates. In particular, it is possible to use, for example,will be calculated by:
where conv is a 1 x 1 convolutional network that maps the input features into a channel, and l denotes the l-th node. The nodes of the last layer will be used as multi-scale fusion. The fusion mode is as follows:
formula (11) summarizes the information of different nodes to obtain F AVG It will be used as an input to the detection head.
And 4, step 4: an anchorless sensing head is used to locate the target. For output F of step 3 AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Respectively, the normalized width and height. The last t is the confidence score, indicating whether the center point of the object exists at that location. Finally, applying the cross entropy loss L on the center point t cls Applying MS on center offset, width and heightE loss L off . At the same time, the GIoU loss is used as a supplemental loss. Finally, the whole function is defined as:
L off =(Δx-t x ) 2 +(Δ y -t y ) 2 (14)
C ij =1 or 0 indicates whether the current grid contains the center point of the true target. Represents the offset of the center point to the center of the grid, wherein x and y refer toint (·) denotes that this operation rounds the fraction to the nearest integer. Performing L only on the grid where the truth center is located off . The total loss function is as follows:
Loss=L cls +λ off L off +L giou (15)
wherein λ is off Set to 5, the network selects the center point of the highest score to generate the bounding box. The interaction-over-Union (IoU) is a metric used in REC to measure the degree of overlap between prediction and reality. Following the previous work, the present invention uses IOU @0.5 to measure prediction accuracy.
Example 1:
1. image feature extraction
Given a picture in a natural scene, the whole picture is adjusted to 640 x 640 and input into a feature extraction network for forward propagation. The embodiment adopts Resnet-101 to extract the image features. Three scale features 20 × 20 × 2048, 40 × 40 × 1024 and 80 × 80 × 512 are obtained, and then the three scale feature maps are mapped to the same dimension d =256 using 1 × 1 convolution.
2. Extraction of language features
The sentence information is decomposed into words, and the feature vectors corresponding to the words are obtained after the words are embedded. In this embodiment, the longest term number is 20. Then the word vector after position coding is input into a BERT network to obtain the characteristic vector E of each vocabulary of the fusion statement information, wherein E belongs to R 20×256 。
3. Feature fusion with cross-modal attention
The image features are expanded into vectors of (400 × 256), (1600 × 256), (6400 × 256) dimensions, and input into the cross-modal fusion module together with the language features (20 × 256). In the language-guided visual attention module, the language features are input as Q and the image features are input as K, V. Each module consists of 2 identical self-attentive layers, the number of attentive heads being 8. Affine transformation is performed on a given language feature in FiLM to obtain a fused feature. And finally, splicing the two fusion features and remapping the two fusion features to the dimension d by using 1 multiplied by 1 convolution to obtain the final fusion feature.
4. Multi-scale fusion based on language guidance
Inputting the feature graph and the language feature of three scales which are fused in the previous stage into a routing node, firstly screening the grids of the feature graph by using the language through a language gate, then performing up-sampling and down-sampling, and sending the original resolution, up-sampling and down-sampling feature graph into a joint gate for information summary. And the effectiveness of the deep network is ensured by a bottleneck module with residual connection between routing nodes of each layer, and six layers are provided. And then in the last layer of routing nodes, the feature graphs of the three scales acquire the final fusion feature through an average strategy.
5. Target localization
For the output of the previous stage, we first obtained a w × h × 5 feature map using a 1 × 1 convolutional layer, showing five predicted values { t } x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Normalized weight and height are indicated, respectively. Last t is setAnd the confidence score represents whether the center point of the object exists in the position. During the reasoning process, the central point with the highest score is selected by the network to generate a bounding box.
6. Model training
The whole training process is end-to-end training. In the embodiment, four training sets of RefCOCO, refCOCO +, refcocg and refertitgame are used as indexes for model training and evaluation. Using the Adam method as an optimizer, batch is set to 8 and the initial learning rate is set to 1e-4. This example performed 20 rounds of training on a 1080Ti GPU with a half-reduced learning rate per 10 rounds of training.
8. Model application
After the training process, a plurality of models can be obtained, the optimal model (the test effect on the test set is optimal) is selected for application, and for the input images and sentences, the images are only required to be adjusted to 640 x 640 size, normalization is carried out, and the sentences are subjected to word segmentation operation to be used as the input of the models. The parameters of the whole network model are fixed, and only image data and language data are input and propagated forwards. And sequentially obtaining image and language feature vectors V and E, and then automatically transmitting the image and language feature vectors V and E into a cross-modal feature module, a language-guided multi-scale fusion module and a positioning module to directly obtain a prediction result. The actual effect graph is shown in fig. 2, and the accurate position of the description information of the related sentence in the image can be efficiently given based on the method.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.
Claims (3)
1. A method for understanding expression based on multi-scale cross-modal feature fusion is characterized by comprising the following steps:
step 1: firstly, adjusting the pictures to the same size, then extracting n-scale feature maps through Resnet-101, and mapping the feature maps to the same size through 1 multiplied by 1 convolutionDimension d is obtainedFor language information, firstly decomposing the language information into words, embedding the words to obtain a characteristic vector corresponding to each word, defining the longest sentence word number as T, and filling the blank of the sentence with less word numbers than T with PAD marks; adding a CLS mark at the beginning position of the statement, and adding an SEP mark at the end of the statement; inputting the word vectors subjected to position coding into a BERT network to obtain feature vectors of all words and phrases fused with statement information
Step 2: inputting E and V into a cross-mode interactive attention module of the model, wherein the cross-mode interactive attention module consists of two parts, namely a linear feature modulation module (FilM) and a visual-guided language attention module; in the FiLM module, a feature-based affine transformation is applied to adaptively influence the output of the network for a given language featureObtaining the whole expression E through an averaging strategy F And then specifically by:
whereinAndweights and offsets for two multi-layer perceptron MLPs with activation function Tanh, for equations (3), < > and < > indicate bit-by-bit multiplication and addition by element, respectively; finally, standard 3 × 3 convolution and ReLU operations are applied to generate the multi-level fusion features
For the visual guided language attention module, the visual feature Vi is first flattened toWherein N is i =H i ×W i Is the number of features of the visual marker, and then is based on the language feature E and the visual feature Z i Calculated from the following formula:
whereinAndis an embedded matrix, with queries, keys and values in the attention module denoted by Q, K and V, respectively;m is the number of attention heads, d is the feature dimension, for simplicity, only one language attention module is used for each level of visual features; then, A i Is composed of twoThe FFN is further encoded with a residual connection to form a fused output
By connecting F f And F t To obtain F ft Then using three 1X 1 convolutional layers to convert F ft Mapping to dimension d; finally, the combined characteristics are obtained
And step 3: constructing language-guided feature pyramid module FPN
Firstly, constructing a routing space with the depth of K, wherein in the routing space, the scale factor between adjacent stages is limited to 2; for each routing node, the input consists of two parts: a multi-level feature map and attention-based language vectors; the grid-level features of each scale in each routing node are hard-selected by the linguistic gate;
first, the input through the language gate consists of two parts: a multi-level feature map and an attention-based language vector; the language vector based on attention mechanism is obtained by the following formula:
a k =softmax(EW k ) (5)
wherein W k ∈R 256×1 Are the learning weights, k represents the depth,sharing to each scale and grid feature; the multi-scale feature map can be expressed asWhere i is the ith dimension, k is the kth layer, N = (H) i ×W i ) (ii) a Language (1)Speak the door throughDynamic selection of language vectorsThe grid level features in (1) specifically operate as follows:
* And · denote convolution operation and Hadamard product, respectively; conv (. Cndot.) represents a 3 × 3 convolutional network, σ (. Cndot.) is the activation function; here, tanh, i.e., max (0,tahn (·)), is used as the door switch; when the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase;
then, the output Y is output i,k Respectively carrying out up-sampling, unchanged holding and down-sampling operations from a small scale to a large scale; the specific operation is as follows:
use ofTo represent the aggregated output in routing node I,the fine-grained features in (1) are further refined by the associative gate, specifically,will be calculated by:
where conv is a 1 x 1 convolutional network that maps the input features into a channel, l denotes the ith node; the nodes of the last layer are used for multi-scale fusion, and the fusion mode is as follows:
formula (11) summarizes the information of different nodes to obtain F AVG It will be used as an input to the detection head;
and 4, step 4: locating a target using an anchorless sensing head
For the output F of step 3 AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Respectively, normalized width and height; the last t is a confidence score representing whether the center point of the object exists at the position; finally, applying the cross entropy loss L on the center point t cls Applying MSE loss L in center offset, width and height off (ii) a Simultaneously, the GIoU loss is used as an auxiliary loss; finally, the whole function is defined as:
L off =(Δx-t x ) 2 +(Δy-t y ) 2 (14)
C ij =1 or 0 indicates whether the current grid contains the center point of the true target, represents the offset of the center point to the center of the grid, wherein x and y refer toint (·) denotes that this operation rounds the fraction to the nearest integer; performing L only on the grid where the truth value center is positioned off (ii) a The total loss function is as follows:
Loss=L cls +λ oFF L off +L giou (15)
wherein λ is off Setting to 5, the network selects the central point of the highest score to generate a bounding box; the IoU is a metric used in REC to measure the degree of overlap between prediction and reality.
2. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.
3. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed, perform the method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211009462.0A CN115496991A (en) | 2022-08-22 | 2022-08-22 | Reference expression understanding method based on multi-scale cross-modal feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211009462.0A CN115496991A (en) | 2022-08-22 | 2022-08-22 | Reference expression understanding method based on multi-scale cross-modal feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115496991A true CN115496991A (en) | 2022-12-20 |
Family
ID=84465769
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211009462.0A Pending CN115496991A (en) | 2022-08-22 | 2022-08-22 | Reference expression understanding method based on multi-scale cross-modal feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115496991A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117764085A (en) * | 2024-01-11 | 2024-03-26 | 华中师范大学 | Machine reading understanding method based on cross-graph collaborative modeling |
-
2022
- 2022-08-22 CN CN202211009462.0A patent/CN115496991A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117764085A (en) * | 2024-01-11 | 2024-03-26 | 华中师范大学 | Machine reading understanding method based on cross-graph collaborative modeling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110163299B (en) | Visual question-answering method based on bottom-up attention mechanism and memory network | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN112015868B (en) | Question-answering method based on knowledge graph completion | |
WO2023024412A1 (en) | Visual question answering method and apparatus based on deep learning model, and medium and device | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
CN112733768B (en) | Natural scene text recognition method and device based on bidirectional characteristic language model | |
CN111967277B (en) | Translation method based on multi-modal machine translation model | |
CN113408343B (en) | Classroom action recognition method based on double-scale space-time block mutual attention | |
CN112163596B (en) | Complex scene text recognition method, system, computer equipment and storage medium | |
CN114863407B (en) | Multi-task cold start target detection method based on visual language deep fusion | |
CN113156419B (en) | Specific language navigation method based on radar and visual multi-mode fusion | |
CN114818703B (en) | Multi-intention recognition method and system based on BERT language model and TextCNN model | |
CN114241191A (en) | Cross-modal self-attention-based non-candidate-box expression understanding method | |
CN115512368A (en) | Cross-modal semantic image generation model and method | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN116564355A (en) | Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion | |
CN113807214A (en) | Small target face recognition method based on deit attached network knowledge distillation | |
CN115496991A (en) | Reference expression understanding method based on multi-scale cross-modal feature fusion | |
CN115080715A (en) | Span extraction reading understanding method based on residual error structure and bidirectional fusion attention | |
CN114612748A (en) | Cross-modal video clip retrieval method based on feature decoupling | |
CN114169408A (en) | Emotion classification method based on multi-mode attention mechanism | |
CN110704668A (en) | Grid-based collaborative attention VQA method and apparatus | |
CN117609536A (en) | Language-guided reference expression understanding reasoning network system and reasoning method | |
CN110990630B (en) | Video question-answering method based on graph modeling visual information and guided by using questions | |
CN115186072A (en) | Knowledge graph visual question-answering method based on double-process cognitive theory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |