CN115496991A - Reference expression understanding method based on multi-scale cross-modal feature fusion - Google Patents

Reference expression understanding method based on multi-scale cross-modal feature fusion Download PDF

Info

Publication number
CN115496991A
CN115496991A CN202211009462.0A CN202211009462A CN115496991A CN 115496991 A CN115496991 A CN 115496991A CN 202211009462 A CN202211009462 A CN 202211009462A CN 115496991 A CN115496991 A CN 115496991A
Authority
CN
China
Prior art keywords
language
feature
scale
features
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211009462.0A
Other languages
Chinese (zh)
Inventor
王鹏
孙梦阳
张艳宁
索伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202211009462.0A priority Critical patent/CN115496991A/en
Publication of CN115496991A publication Critical patent/CN115496991A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/86Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-scale cross-modal feature fusion-based expression understanding method, which belongs to the field of language image multi-modal fusion. In the cross-modal feature fusion process, a linear feature modulation and visual guidance language attention module is combined for feature fusion, and meanwhile, the model utilizes language to select and screen multi-scale grid-level features, adaptively selects key clues from low-level and high-level fusion features, and establishes a relation between cross-scale information through dynamic routing. Experimental results show that the new model architecture reaches the new most advanced level in a plurality of benchmark tests, and provides new insights and directions for REC research.

Description

Reference expression understanding method based on multi-scale cross-modal feature fusion
Technical Field
The invention belongs to the field of multi-modal fusion based on language images, and particularly relates to a reference expression understanding method based on multi-scale cross-modal feature fusion.
Background
The Return Expression Comprehension (REC) is a challenging and significant task in the field of computer vision. The task requires that a computer obtains a target area corresponding to description through inference analysis according to given images and natural language description, and is a basic task in multiple fields of man-machine interaction, visual question answering, image retrieval and the like. Unlike traditional object detection, which classifies fixed objects using predefined class labels, REC involves a comprehensive understanding of complex natural languages and variable images. The research field of REC focuses mainly on the strategy of fusion, the stage of fusion and the design of detection heads. However, although REC is a cross-modal task, it is still closely linked to object detection.
Modern object detection models can be roughly converted into three stages: a network backbone, a neck, and a detection head. For the neck part, experience with inspection models has demonstrated that assembling a well-matched Feature Pyramid Network (FPN) is important to close semantic gaps and improve performance. However, on the one hand, the REC models in the present stage are usually only to use an averaging strategy to fuse multi-scale features, even to use single-scale features, and the research on the fusion of multi-scale features is far from enough. On the other hand, because the low-level features contain more attribute information such as color, texture and the like, and the high-level features have rich semantic information, the multi-scale feature fusion can combine the advantages of the low-level features and the high-level features to be suitable for different language expressions. The invention provides a multi-scale fusion method suitable for REC tasks, which helps model reasoning and accurately positions targets with different sizes.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a reference expression understanding method based on multi-scale cross-modal feature fusion. The method can dynamically allocate and select fine-grained information from the multi-scale feature map by using the linguistic gate and the joint gate. Is a more effective and reliable method for understanding the expression of the expression.
Technical scheme
A method for understanding expression based on multi-scale cross-modal feature fusion is characterized by comprising the following steps:
step 1: the pictures are firstly adjusted to the same sizeThen extracting n scales of feature maps through Resnet-101, and mapping the feature maps to the same dimension d through 1 multiplied by 1 convolution to obtain
Figure BDA0003809446600000021
For language information, firstly decomposing the language information into words, embedding the words to obtain a characteristic vector corresponding to each word, defining the longest sentence word number as T, and filling the blank of the sentence with less word numbers than T with PAD marks; adding a CLS mark at the beginning position of the statement, and adding an SEP mark at the end of the statement; inputting the word vectors subjected to position coding into a BERT network to obtain the feature vectors of all words and phrases fused with statement information
Figure BDA0003809446600000022
And 2, step: inputting E and V into a cross-mode interactive attention module of the model, wherein the cross-mode interactive attention module consists of two parts, namely a linear feature modulation module (FilM) and a visual-guided language attention module; in the FiLM module, a feature-based affine transformation is applied to adaptively influence the output of the network for a given language feature
Figure BDA0003809446600000023
Obtaining the whole expression E through an averaging strategy F And then specifically by:
Figure BDA0003809446600000024
Figure BDA0003809446600000025
Figure BDA0003809446600000026
wherein W i γ ,W i β
Figure BDA0003809446600000027
And
Figure BDA0003809446600000028
weights and deviations of two multilayer perceptrons MLPs with activation function Tanh, for equations (3), <' > and
Figure BDA0003809446600000029
respectively representing element-wise multiplication and addition; finally, standard 3 × 3 convolution and ReLU operations are applied to generate a multi-level fusion feature
Figure BDA00038094466000000210
For the visual-guided language attention module, the visual features Vi are first flattened into
Figure BDA00038094466000000211
Wherein N is i =H i ×W i Is the number of features of the visual marker, and then is based on the language feature E and the visual feature Z i Calculated from the following formula:
Figure BDA0003809446600000031
Figure BDA0003809446600000032
wherein W i Q
Figure BDA0003809446600000033
And
Figure BDA0003809446600000034
is an embedded matrix, with queries, keys and values in the attention module denoted by Q, K and V, respectively;
Figure BDA0003809446600000035
m is the number of attention heads and d is the characteristic dimension, for simplicityOnly one language attention module is used for each level of visual features; then, A i Further encoded by two feedforward networks FFN with residual connection to form a fused output
Figure BDA0003809446600000036
By means of a connection F f And F t To obtain F ft Then F was laminated using three 1X 1 convolutional layers ft Mapping to dimension d; finally, the combined characteristics are obtained
Figure BDA0003809446600000037
And step 3: constructing language-guided feature pyramid module FPN
Firstly, constructing a routing space with the depth of K, wherein in the routing space, the scale factor between adjacent stages is limited to 2; for each routing node, the input consists of two parts: a multi-level feature map and an attention-based language vector; the grid-level features of each scale in each routing node are selected by the language gate;
first, the input through the language gate consists of two parts: a multi-level feature map and an attention-based language vector; the language vector based on attention mechanism is obtained by the following formula:
a k =softmax(EW k ) (5)
Figure BDA0003809446600000038
wherein W k ∈R 256×1 Are the learning weights, k represents the depth,
Figure BDA0003809446600000039
sharing to each scale and grid feature; the multi-scale feature map can be expressed as
Figure BDA00038094466000000310
Where i is the ith dimension, k is the kth layer, N = (H) i ×W i ) (ii) a Language gate pass through
Figure BDA00038094466000000311
Dynamic selection of language vectors
Figure BDA00038094466000000312
The grid level features in (1) specifically operate as follows:
Figure BDA00038094466000000313
Figure BDA00038094466000000314
* And · respectively representing convolution operations and Hadamard products; conv (-) denotes a 3 × 3 convolutional network, σ (-) is the activation function; here, tanh, i.e., max (0,tahn (·)) is used as the door switch; when the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase;
then, the output Y is i,k Respectively carrying out up-sampling, unchanged holding and down-sampling operations according to the small scale and the large scale; the specific operation is as follows:
Figure BDA0003809446600000041
use of
Figure BDA0003809446600000042
To represent the aggregated output in routing node I,
Figure BDA0003809446600000043
the fine-grained features in (1) are further refined by the associative gate, specifically,
Figure BDA0003809446600000044
will be calculated by:
Figure BDA0003809446600000045
Figure BDA0003809446600000046
where conv is a 1 x 1 convolutional network that maps the input features into a channel, l denotes the ith node; the nodes of the last layer are used for multi-scale fusion, and the fusion mode is as follows:
Figure BDA0003809446600000047
formula (11) summarizes the information of different nodes to obtain F AVG It will be used as an input to the detection head;
and 4, step 4: locating a target using an anchorless sensing head
For output F of step 3 AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Respectively, normalized width and height; the last t is a confidence score which represents whether the central point of the object exists at the position or not; finally, applying the cross entropy loss L on the center point t cls, Applying MSE loss L in center offset, width and height off (ii) a Simultaneously, the GIoU loss is used as an auxiliary loss; finally, the whole function is defined as:
Figure BDA0003809446600000048
L off =(Δx-t x ) 2 +(Δ y -t y ) 2 (14)
C ij =1 or 0 denotes whether or not the current cell contains true targetThe center point of the central point is provided with a central point,
Figure BDA0003809446600000049
Figure BDA0003809446600000051
represents the offset of the center point to the center of the grid, wherein x and y refer to
Figure BDA0003809446600000052
int (·) denotes that this operation rounds the fraction to the nearest integer; performing L only on the grid where the truth value center is positioned off (ii) a The total loss function is as follows:
Loss=L clsoff L off +L giou (15)
wherein λ is off Setting to 5, the network selects the central point with the highest score to generate a bounding box; the IoU is a metric used in REC to measure the degree of overlap between prediction and reality.
A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the above-described method.
A computer-readable storage medium having stored thereon computer-executable instructions for performing the above-described method when executed.
Advantageous effects
The invention provides a reference expression understanding method based on multi-scale cross-modal feature fusion, which utilizes innovative and effective multi-scale cross-modal fusion to carry out reference expression understanding. Different from the previous model, the model combines linear feature modulation and a visual guidance language attention module to perform feature fusion in the cross-modal feature fusion process. Meanwhile, the model utilizes language to select and screen multi-scale grid-level features, self-adaptively selects key clues from low-level and high-level fusion features, and establishes a relation between cross-scale information through dynamic routing. Experimental results show that the new model architecture reaches the new most advanced level in a plurality of benchmark tests, and provides new insights and directions for REC research.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a schematic diagram of a stage-named expression understanding network structure of the method of the present invention.
FIG. 2 understands the results based on the nomenclature expression of the multi-scale cross-modal feature fusion mechanism.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The technical scheme of the invention comprises the following modules: the first part is an extraction and coding module for the language and image information, the second part is a cross-modal feature fusion module based on linear feature modulation (FiLM) and a language-guided visual attention mechanism, the third part is a multi-scale feature fusion module based on language guidance, and the fourth part is a positioning process of a target. In the first part, a Resnet-101 convolutional neural network and a BERT pre-training model are adopted to respectively extract the features of the picture information and the language information. In the second part, the cross-modal features are respectively fused by adopting the FilM and the language-guided visual attention, and then the cross-modal features and the language-guided visual attention are mapped to obtain final fusion features. In the third part, the characteristics with different scales are selected for fusion by using a language gate and a joint gate to execute a dynamic routing strategy in a multi-scale characteristic fusion module guided by the language. And finally, sending the fused multi-scale features into a prediction head to obtain a target object region.
Based on the above modules, the embodiment of the present invention provides a one-stage cross-modal attention mechanism and language-guided multi-scale fusion-combined candidate-box-free expression method, which includes the following specific processes:
step 1: the pictures are firstly adjusted to be the same in size, then n scales of feature maps are extracted through Resnet-101, and the feature maps are mapped to the same dimension d through 1 multiplied by 1 convolution to obtain the picture
Figure BDA0003809446600000061
For language information, the language information is decomposed into words, and the feature vectors corresponding to the words are obtained after the words are embedded. And (5) specifying that the longest sentence word number is T, and filling the blank of the sentence with less word number than T with PAD marks. CLS marks are added at the beginning positions of the sentences, and SEP marks are added at the ends of the sentences. Inputting the word vectors subjected to position coding into a BERT network to obtain the feature vectors of all words and phrases fused with statement information
Figure BDA0003809446600000062
Step 2: e and V are input into a cross-modal interactive attention module of the model. The module consists of two parts, a linear feature modulation module (FiLM) and a visual-guided language attention module. In the FiLM module, a feature-based affine transformation is applied to adaptively affect the output of the network. For a given language feature
Figure BDA0003809446600000071
Firstly, obtaining the whole expression E through a simple averaging strategy F And then specifically by:
Figure BDA0003809446600000072
Figure BDA0003809446600000073
Figure BDA0003809446600000074
wherein W i γ ,W i β
Figure BDA0003809446600000075
And
Figure BDA0003809446600000076
are the weights and the deviations of the two multi-layer perceptrons MLP with activation functions Tanh. For equation (3). An as
Figure BDA0003809446600000077
Respectively representing element-wise multiplication and addition. Finally, standard 3 × 3 convolution and ReLU operations are applied to generate the multi-level fusion features
Figure BDA0003809446600000078
For the visual-guided language attention module, the visual features Vi are first flattened into
Figure BDA0003809446600000079
Wherein N is i =H i ×W i Is the number of features of the visual indicia. Then according to the language characteristic E and the visual characteristic Z i Calculated from the following formula:
Figure BDA00038094466000000710
Figure BDA00038094466000000711
wherein W i Q
Figure BDA00038094466000000712
And
Figure BDA00038094466000000713
is an embedded matrix.The queries, keys and values in the attention module are denoted by Q, K and V, respectively.
Figure BDA00038094466000000714
m is the number of attention heads and d is the characteristic dimension. For simplicity, only one language attention module is used for each level of visual features. Then, A i Further encoded by two Feed Forward Networks (FFN) with residual connection to form a fused output
Figure BDA00038094466000000715
By connecting F f And F t To obtain F ft Then F was laminated using three 1X 1 convolutional layers ft Mapping to dimension d. Finally, the combined characteristics are obtained
Figure BDA00038094466000000716
And step 3: a language-guided feature pyramid module (FPN) is constructed.
Firstly, a routing space with the depth of K is constructed. In this routing space, the scaling factor between adjacent stages is limited to 2. For each routing node, the input consists of two parts: multi-level feature maps and attention-based language vectors. The grid-level (grid) features of each scale in each routing node are hard-chosen by the language gate. In addition, since REC is a cross-modal task, a data-dependent associative gate module is established in the node, which will further refine the grid based on the aggregated information.
First, the input through the language gate consists of two parts: multi-level feature maps and attention-based language vectors. The language vector based on attention mechanism is obtained by the following formula:
a k =softmax(EW k ) (5)
Figure BDA0003809446600000081
wherein W k ∈R 256×1 Are the learning weights, k represents the depth,
Figure BDA0003809446600000082
will be shared to each scale and grid feature. The multi-scale feature map can be expressed as
Figure BDA0003809446600000083
Where i is the ith dimension, k is the kth layer, N = (H) i ×W i ). Language gate pass
Figure BDA0003809446600000084
Dynamic selection of language vectors
Figure BDA0003809446600000085
The specific operation of the grid-level features in (1) is as follows:
Figure BDA0003809446600000086
Figure BDA0003809446600000087
* And · denote convolution operation and Hadamard product, respectively. conv (-) represents a 3 × 3 convolutional network, σ (-) is the activation function. Here, tanh, i.e., max (0,tahn (·)) is used as the door switch. When the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase.
Then, the output Y is output i,k And respectively carrying out up-sampling, keeping unchanged and down-sampling operations from a small scale to a large scale. The specific operation is as follows:
Figure BDA0003809446600000088
use of
Figure BDA0003809446600000089
To represent the aggregate output in routing node I, in order to improve the efficiency of this module deep network, a bottleneck module (bottleeck) with residual connections is used, in which,
Figure BDA00038094466000000810
the fine-grained features in (1) are further refined by the associative gates. In particular, it is possible to use, for example,
Figure BDA00038094466000000811
will be calculated by:
Figure BDA0003809446600000091
Figure BDA0003809446600000092
where conv is a 1 x 1 convolutional network that maps the input features into a channel, and l denotes the l-th node. The nodes of the last layer will be used as multi-scale fusion. The fusion mode is as follows:
Figure BDA0003809446600000093
formula (11) summarizes the information of different nodes to obtain F AVG It will be used as an input to the detection head.
And 4, step 4: an anchorless sensing head is used to locate the target. For output F of step 3 AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Respectively, the normalized width and height. The last t is the confidence score, indicating whether the center point of the object exists at that location. Finally, applying the cross entropy loss L on the center point t cls Applying MS on center offset, width and heightE loss L off . At the same time, the GIoU loss is used as a supplemental loss. Finally, the whole function is defined as:
Figure BDA0003809446600000094
L off =(Δx-t x ) 2 +(Δ y -t y ) 2 (14)
C ij =1 or 0 indicates whether the current grid contains the center point of the true target.
Figure BDA0003809446600000095
Figure BDA0003809446600000096
Represents the offset of the center point to the center of the grid, wherein x and y refer to
Figure BDA0003809446600000097
int (·) denotes that this operation rounds the fraction to the nearest integer. Performing L only on the grid where the truth center is located off . The total loss function is as follows:
Loss=L clsoff L off +L giou (15)
wherein λ is off Set to 5, the network selects the center point of the highest score to generate the bounding box. The interaction-over-Union (IoU) is a metric used in REC to measure the degree of overlap between prediction and reality. Following the previous work, the present invention uses IOU @0.5 to measure prediction accuracy.
Example 1:
1. image feature extraction
Given a picture in a natural scene, the whole picture is adjusted to 640 x 640 and input into a feature extraction network for forward propagation. The embodiment adopts Resnet-101 to extract the image features. Three scale features 20 × 20 × 2048, 40 × 40 × 1024 and 80 × 80 × 512 are obtained, and then the three scale feature maps are mapped to the same dimension d =256 using 1 × 1 convolution.
2. Extraction of language features
The sentence information is decomposed into words, and the feature vectors corresponding to the words are obtained after the words are embedded. In this embodiment, the longest term number is 20. Then the word vector after position coding is input into a BERT network to obtain the characteristic vector E of each vocabulary of the fusion statement information, wherein E belongs to R 20×256
3. Feature fusion with cross-modal attention
The image features are expanded into vectors of (400 × 256), (1600 × 256), (6400 × 256) dimensions, and input into the cross-modal fusion module together with the language features (20 × 256). In the language-guided visual attention module, the language features are input as Q and the image features are input as K, V. Each module consists of 2 identical self-attentive layers, the number of attentive heads being 8. Affine transformation is performed on a given language feature in FiLM to obtain a fused feature. And finally, splicing the two fusion features and remapping the two fusion features to the dimension d by using 1 multiplied by 1 convolution to obtain the final fusion feature.
4. Multi-scale fusion based on language guidance
Inputting the feature graph and the language feature of three scales which are fused in the previous stage into a routing node, firstly screening the grids of the feature graph by using the language through a language gate, then performing up-sampling and down-sampling, and sending the original resolution, up-sampling and down-sampling feature graph into a joint gate for information summary. And the effectiveness of the deep network is ensured by a bottleneck module with residual connection between routing nodes of each layer, and six layers are provided. And then in the last layer of routing nodes, the feature graphs of the three scales acquire the final fusion feature through an average strategy.
5. Target localization
For the output of the previous stage, we first obtained a w × h × 5 feature map using a 1 × 1 convolutional layer, showing five predicted values { t } x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Normalized weight and height are indicated, respectively. Last t is setAnd the confidence score represents whether the center point of the object exists in the position. During the reasoning process, the central point with the highest score is selected by the network to generate a bounding box.
6. Model training
The whole training process is end-to-end training. In the embodiment, four training sets of RefCOCO, refCOCO +, refcocg and refertitgame are used as indexes for model training and evaluation. Using the Adam method as an optimizer, batch is set to 8 and the initial learning rate is set to 1e-4. This example performed 20 rounds of training on a 1080Ti GPU with a half-reduced learning rate per 10 rounds of training.
8. Model application
After the training process, a plurality of models can be obtained, the optimal model (the test effect on the test set is optimal) is selected for application, and for the input images and sentences, the images are only required to be adjusted to 640 x 640 size, normalization is carried out, and the sentences are subjected to word segmentation operation to be used as the input of the models. The parameters of the whole network model are fixed, and only image data and language data are input and propagated forwards. And sequentially obtaining image and language feature vectors V and E, and then automatically transmitting the image and language feature vectors V and E into a cross-modal feature module, a language-guided multi-scale fusion module and a positioning module to directly obtain a prediction result. The actual effect graph is shown in fig. 2, and the accurate position of the description information of the related sentence in the image can be efficiently given based on the method.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims (3)

1. A method for understanding expression based on multi-scale cross-modal feature fusion is characterized by comprising the following steps:
step 1: firstly, adjusting the pictures to the same size, then extracting n-scale feature maps through Resnet-101, and mapping the feature maps to the same size through 1 multiplied by 1 convolutionDimension d is obtained
Figure FDA0003809446590000011
For language information, firstly decomposing the language information into words, embedding the words to obtain a characteristic vector corresponding to each word, defining the longest sentence word number as T, and filling the blank of the sentence with less word numbers than T with PAD marks; adding a CLS mark at the beginning position of the statement, and adding an SEP mark at the end of the statement; inputting the word vectors subjected to position coding into a BERT network to obtain feature vectors of all words and phrases fused with statement information
Figure FDA0003809446590000012
Step 2: inputting E and V into a cross-mode interactive attention module of the model, wherein the cross-mode interactive attention module consists of two parts, namely a linear feature modulation module (FilM) and a visual-guided language attention module; in the FiLM module, a feature-based affine transformation is applied to adaptively influence the output of the network for a given language feature
Figure FDA0003809446590000013
Obtaining the whole expression E through an averaging strategy F And then specifically by:
Figure FDA0003809446590000014
Figure FDA0003809446590000015
Figure FDA0003809446590000016
wherein
Figure FDA0003809446590000017
And
Figure FDA0003809446590000018
weights and offsets for two multi-layer perceptron MLPs with activation function Tanh, for equations (3), < > and < > indicate bit-by-bit multiplication and addition by element, respectively; finally, standard 3 × 3 convolution and ReLU operations are applied to generate the multi-level fusion features
Figure FDA0003809446590000019
For the visual guided language attention module, the visual feature Vi is first flattened to
Figure FDA00038094465900000110
Wherein N is i =H i ×W i Is the number of features of the visual marker, and then is based on the language feature E and the visual feature Z i Calculated from the following formula:
Figure FDA00038094465900000111
Figure FDA00038094465900000112
wherein
Figure FDA0003809446590000021
And
Figure FDA0003809446590000022
is an embedded matrix, with queries, keys and values in the attention module denoted by Q, K and V, respectively;
Figure FDA0003809446590000023
m is the number of attention heads, d is the feature dimension, for simplicity, only one language attention module is used for each level of visual features; then, A i Is composed of twoThe FFN is further encoded with a residual connection to form a fused output
Figure FDA0003809446590000024
By connecting F f And F t To obtain F ft Then using three 1X 1 convolutional layers to convert F ft Mapping to dimension d; finally, the combined characteristics are obtained
Figure FDA0003809446590000025
And step 3: constructing language-guided feature pyramid module FPN
Firstly, constructing a routing space with the depth of K, wherein in the routing space, the scale factor between adjacent stages is limited to 2; for each routing node, the input consists of two parts: a multi-level feature map and attention-based language vectors; the grid-level features of each scale in each routing node are hard-selected by the linguistic gate;
first, the input through the language gate consists of two parts: a multi-level feature map and an attention-based language vector; the language vector based on attention mechanism is obtained by the following formula:
a k =softmax(EW k ) (5)
Figure FDA0003809446590000026
wherein W k ∈R 256×1 Are the learning weights, k represents the depth,
Figure FDA0003809446590000027
sharing to each scale and grid feature; the multi-scale feature map can be expressed as
Figure FDA0003809446590000028
Where i is the ith dimension, k is the kth layer, N = (H) i ×W i ) (ii) a Language (1)Speak the door through
Figure FDA0003809446590000029
Dynamic selection of language vectors
Figure FDA00038094465900000210
The grid level features in (1) specifically operate as follows:
Figure FDA00038094465900000211
Figure FDA00038094465900000212
* And · denote convolution operation and Hadamard product, respectively; conv (. Cndot.) represents a 3 × 3 convolutional network, σ (. Cndot.) is the activation function; here, tanh, i.e., max (0,tahn (·)), is used as the door switch; when the input is negative, the output of the function is always 0, which makes no additional threshold needed in the inference phase;
then, the output Y is output i,k Respectively carrying out up-sampling, unchanged holding and down-sampling operations from a small scale to a large scale; the specific operation is as follows:
Figure FDA0003809446590000031
use of
Figure FDA0003809446590000032
To represent the aggregated output in routing node I,
Figure FDA0003809446590000033
the fine-grained features in (1) are further refined by the associative gate, specifically,
Figure FDA0003809446590000034
will be calculated by:
Figure FDA0003809446590000035
Figure FDA0003809446590000036
where conv is a 1 x 1 convolutional network that maps the input features into a channel, l denotes the ith node; the nodes of the last layer are used for multi-scale fusion, and the fusion mode is as follows:
Figure FDA0003809446590000037
formula (11) summarizes the information of different nodes to obtain F AVG It will be used as an input to the detection head;
and 4, step 4: locating a target using an anchorless sensing head
For the output F of step 3 AVG First, a feature map having a shape of w × h × 5 is obtained using a 1 × 1 convolutional layer, and five predicted values { t } are shown x ,t y ,t w ,t h t, where the first two values represent the center offset, t w And t h Respectively, normalized width and height; the last t is a confidence score representing whether the center point of the object exists at the position; finally, applying the cross entropy loss L on the center point t cls Applying MSE loss L in center offset, width and height off (ii) a Simultaneously, the GIoU loss is used as an auxiliary loss; finally, the whole function is defined as:
Figure FDA0003809446590000038
L off =(Δx-t x ) 2 +(Δy-t y ) 2 (14)
C ij =1 or 0 indicates whether the current grid contains the center point of the true target,
Figure FDA0003809446590000039
Figure FDA00038094465900000310
represents the offset of the center point to the center of the grid, wherein x and y refer to
Figure FDA00038094465900000311
int (·) denotes that this operation rounds the fraction to the nearest integer; performing L only on the grid where the truth value center is positioned off (ii) a The total loss function is as follows:
Loss=L clsoFF L off +L giou (15)
wherein λ is off Setting to 5, the network selects the central point of the highest score to generate a bounding box; the IoU is a metric used in REC to measure the degree of overlap between prediction and reality.
2. A computer system, comprising: one or more processors, a computer readable storage medium, for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.
3. A computer-readable storage medium having stored thereon computer-executable instructions, which when executed, perform the method of claim 1.
CN202211009462.0A 2022-08-22 2022-08-22 Reference expression understanding method based on multi-scale cross-modal feature fusion Pending CN115496991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211009462.0A CN115496991A (en) 2022-08-22 2022-08-22 Reference expression understanding method based on multi-scale cross-modal feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211009462.0A CN115496991A (en) 2022-08-22 2022-08-22 Reference expression understanding method based on multi-scale cross-modal feature fusion

Publications (1)

Publication Number Publication Date
CN115496991A true CN115496991A (en) 2022-12-20

Family

ID=84465769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211009462.0A Pending CN115496991A (en) 2022-08-22 2022-08-22 Reference expression understanding method based on multi-scale cross-modal feature fusion

Country Status (1)

Country Link
CN (1) CN115496991A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117764085A (en) * 2024-01-11 2024-03-26 华中师范大学 Machine reading understanding method based on cross-graph collaborative modeling

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117764085A (en) * 2024-01-11 2024-03-26 华中师范大学 Machine reading understanding method based on cross-graph collaborative modeling

Similar Documents

Publication Publication Date Title
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN112015868B (en) Question-answering method based on knowledge graph completion
WO2023024412A1 (en) Visual question answering method and apparatus based on deep learning model, and medium and device
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN111967277B (en) Translation method based on multi-modal machine translation model
CN113408343B (en) Classroom action recognition method based on double-scale space-time block mutual attention
CN112163596B (en) Complex scene text recognition method, system, computer equipment and storage medium
CN114863407B (en) Multi-task cold start target detection method based on visual language deep fusion
CN113156419B (en) Specific language navigation method based on radar and visual multi-mode fusion
CN114818703B (en) Multi-intention recognition method and system based on BERT language model and TextCNN model
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN115512368A (en) Cross-modal semantic image generation model and method
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN116564355A (en) Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion
CN113807214A (en) Small target face recognition method based on deit attached network knowledge distillation
CN115496991A (en) Reference expression understanding method based on multi-scale cross-modal feature fusion
CN115080715A (en) Span extraction reading understanding method based on residual error structure and bidirectional fusion attention
CN114612748A (en) Cross-modal video clip retrieval method based on feature decoupling
CN114169408A (en) Emotion classification method based on multi-mode attention mechanism
CN110704668A (en) Grid-based collaborative attention VQA method and apparatus
CN117609536A (en) Language-guided reference expression understanding reasoning network system and reasoning method
CN110990630B (en) Video question-answering method based on graph modeling visual information and guided by using questions
CN115186072A (en) Knowledge graph visual question-answering method based on double-process cognitive theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination