CN116543146A

CN116543146A - Image dense description method based on window self-attention and multi-scale mechanism

Info

Publication number: CN116543146A
Application number: CN202310822911.1A
Authority: CN
Inventors: 邓宏宇; 王崎; 王建军; 吴雪; 张邦梅
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-08-04
Anticipated expiration: 2043-07-06
Also published as: CN116543146B

Abstract

The invention discloses an image dense description method based on window self-attention and multi-scale mechanism, which is formed by combining a target detector and a region description generator, wherein an input image is subjected to image characterization learning and extraction through a window attention-based feature encoder in the target detector, the feature encoder is formed by stacking 12 layers ViT modules, an image feature image is divided into a plurality of windows with equal size and performs attention operation in the windows in each layer of modules, the feature encoder calculates 5 image features with different scales, the position information of a key region is predicted through a target detection head, the region features are cut out from the multi-scale features according to the model, the region description generator adopts a pre-training BERT model as a core, and the region description is generated in an autoregressive mode according to the input global characterization and the region features. The invention can accurately capture a plurality of key objects of the image and generate high-quality description.

Description

Image dense description method based on window self-attention and multi-scale mechanism

Technical Field

The invention relates to the field of computer vision and natural language processing, in particular to an image dense description method based on a window self-attention network and multi-scale features.

Background

Image dense description (dense image capture) is a superior task to the open world object detection task, which requires a model to detect salient regions on the input image and describe the region content using short sentences, an artificial intelligence method that combines computer vision technology with natural language processing technology.

Compared with the conventional target detection method at present, the image dense description method has stronger image recognition capability and wider object recognition range, and has the capability of recognizing the object category outside the training set. In the working process, the image dense description method adopts a form of human language to describe the recognition object, is more close to the thinking mode of the human cognitive world, and is an important technology for building strong artificial intelligence in the future.

In contrast to the conventional image description method, the dense image description method does not understand and summarize the global content of the image, but locates multiple rois (regions of interest) of the image and generates descriptions respectively. The working mode can more effectively retain the key information of the image and transfer the content of interest to the user.

Image dense description techniques may be used for image retrieval tasks to search for images that contain particular visual concepts or scenes by generating natural language descriptions for various regions of the image; the image analysis task is used for image understanding analysis tasks and assisting in analyzing and understanding complex images containing a plurality of objects, actions and interactions; for image editing modification tasks, the user is assisted in editing and manipulating images by providing natural language commands based on dense descriptions.

At present, the image dense description method is mainly based on convolutional neural network to realize feature extraction of an input image, and the cyclic neural network is utilized to correspondingly describe the regional features. This approach, while easy to implement, has several problems:

1. the convolutional neural network has limitations on image extraction and cognitive ability, and has strong local feature extraction ability, but the global characterization of the image is difficult to master. In addition, when the convolutional neural network architecture is too complex, the training difficulty of the model is obviously increased, and the performance is also difficult to improve. These drawbacks result in the inability of dense description models based on convolutional neural networks to process complex input images;

2. because the cyclic neural network cannot realize the generation of the region description in parallel, and the calculation amount and the time consumption are huge, the image dense description process is long, and the efficiency is low. Furthermore, recurrent neural networks have natural drawbacks for long sequence information, resulting in poor quality of the generated description.

Chinese patent application publication No. CN114037831a discloses an image depth dense description method, system and storage medium, in 2022, 02 and 11, which uses a basic convolutional neural network for extraction, and has low performance and low efficiency. The patent application directly utilizes the RPN network to extract the region of interest on the image feature map, and the development of the regions of interest with different sizes is not comprehensive enough; the patent application uses an LSTM network to generate descriptions for each region of interest, the speed is general, and the description quality is low.

Disclosure of Invention

The invention aims to overcome the defects and provide an image dense description method based on a window self-care network and multi-scale features, which can accurately capture a plurality of key objects of an image and generate high-quality description.

The invention discloses an image dense description method based on a window self-attention network and multi-scale features, which comprises the following steps:

step 1, input image X coarse processing: inputting an image X with the size equal to 1024 multiplied by 1024, dividing the X into a plurality of image blocks with the size of k, and performing coarse processing by using a convolution kernel with the size of k to obtain a coarse image characteristic X ^’ ；

Step 2 Global characterization of images V _f Is calculated by (1): inputting coarse image features X ^’ By pre-training ViT model as feature encoder of image, global characterization V of image is obtained _f The ViT model is formed by stacking multiple layers of transducer modules, in each layer of transducer module, an image representation is divided into windows with the size alpha, the attention among pixels in the windows is calculated only, and the global representation V is finally obtained through multi-layer network calculation _f ；

Step 3, multi-scale feature acquisition: taking the global characterization V obtained in the previous step _f Respectively obtaining a multi-scale feature set F= { F through 5 different convolution neural network branches ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ -to adapt to target detection of different sizes;

and 4, salient target prediction and regional feature extraction:

step 4.1 significant target prediction, for the multi-scale feature set f= { F ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ Identifying targets contained in the image features by using 5 independent prediction network heads respectively;

an i-th predictive network header with an input f _i ，f _i Local features are extracted using a convolution layer with a convolution kernel size of 3, then processed through groupnum (group normalization), and finally through an activation function RELU, the formula is as follows:

f _i ^’ =ReLU(GroupNorm(Conv(f _i ) ) and (formula 7)

The above procedure will be repeated 4 times;

each prediction network head sets a leavable parameter A _i And M is as follows _i Respectively with f _i ^’ And performing addition and multiplication operations, wherein the formula is as follows:

f _i ^” =( f _i ^’ +A _i )·M _i (equation 8)

For spatial feature f _i ^” Two convolution network branches are adopted to obtain the predicted space coordinate bbox under the scale _i And confidence agn _i The formula is as follows:

bbox _i =ReLU(Conv(f _i ^” ) (equation 9)

agni=conv (fi ") (equation 10)

Step 4.2 target detector training and loss function, for the target detector's salient region prediction result bbox= { BBOX ₁ , bbox ₂ , bbox ₃ , bbox ₄ , bbox ₅ Finding out a Target closest to the distance between the training data set and each predicted result, and defining the Target set as a Target; measuring the difference between the predicted result and the actual target by using a measurement index CIOU, wherein the CIOU is in direct proportion to the performance of the target detector; defining a prediction area as g and an actual target area as t, and calculating the CIOU by the following formula:

CIOU=IOU-ρ ² (g,t)/c ² -betav (formula 11)

IOU= |g n t|/|g u t| (equation 12)

V=4/Π ² （arctan(w ^t /m ^t )-arctan(w/m)） ² (equation 13)

Beta=v/(1-iou+v) (equation 14)

Wherein ρ represents the Euclidean distance between the center points of the predicted region g and the actual target region t, c represents the diagonal distance of the minimum closure region of the predicted region g and the actual target region t, and w ^t And m ^t The width and the height of the actual target area t are respectively represented, and w and m represent the width and the height of the predicted area g respectively;

loss function L for training a target detector _dec The calculation formula is as follows:

L _dec =1-CIOU (equation 15)

Step 4.3, extracting regional characteristics, namely cutting the characteristics of the corresponding region from a multi-scale characteristic set F according to a remarkable regional prediction result BBOX of the target detector, wherein the regional characteristic set is marked as R;

step 5, generating image dense description:

step 5.1 text feature T ^’ According to the Target set determined in the step 4.2 being Target, collecting natural language description corresponding to the region from the training data set, and defining the natural language description set as TargetText; converting the TargetText into word vector features using a word embedding layer of a pre-trained BERT model, defined as T; for n-dimensional word vector feature T, word vector position encoding PE is calculated ⁿ And superimposed on the word vector feature T to finally obtain the text feature T ^’ The formula is as follows:

PE ⁿ ={PE _(pos,2i) =sin(pos/1000 ^(2i/n) ), PE ^(pos,2i+1) =cos(pos/1000 ^(2i/n) ) ' formula 16)

T ^’ =T+ PE ⁿ (equation 17)

Wherein pos is [1,2, … ], i is [0,1, …, n/2];

step 5.2 description creation, mapping the region feature set R to a high-dimensional space with a fully connected layer, denoted as high-dimensional region feature R ^’ The method comprises the steps of carrying out a first treatment on the surface of the Global characterization V _f High-dimensional regional characteristics R ^’ And text feature T ^’ Splicing to obtain a multi-mode characteristic H, wherein the formula is as follows:

H=Concat(V _f , R ^’ , T ^’ ) (equation 18)

The description generator takes the multi-modal characteristic H as input, uses a pre-training BERT model to fuse multi-modal information, the BERT model is formed by stacking a plurality of layers of transformers network layers, the calculation process of the transformers of each layer carries out self-attention calculation on the input multi-modal characteristic H, and the calculation result of the BERT model is recorded as H ^’ ；

Defining the scale of the word list built in the model as E _voc Using a fully connected layer to connect H ^’ Mapping to high-dimensional E _voc And proceed using a softmax functionReason, define output result as pro _l The formula is as follows:

pro _l =softmax(Linear(H ^’ ) (equation 25)

Wherein l is the maximum length of the generated region description; pro _l ⁱ Defining as generating area description the predictive probability of each word at the ith position, taking the word corresponding to the maximum probability as the candidate word w of the position _i The method comprises the steps of carrying out a first treatment on the surface of the Finally, the region description w= { W is generated ₁ ,w ₂ ,…,w _l }；

Step 5.3 description Generator training and loss function, taking the natural language description set TargetText, converting it to Length E _voc Is used for calculating a loss function; performing label smoothing on the single thermal code, defining a certain single thermal code as h, and obtaining a label smoothing result as h ^’ The label smoothing process formula is as follows:

h ^’ =(1.0-eps)·h+eps/E _voc (equation 26)

Wherein eps is a smaller constant customized by the technical scheme;

after this is done, the loss L of the dense description is calculated using the cross entropy function _ce ；L _ce The calculation formula of (2) is as follows:

L _ce =-Σ _i=1 ^N log(p(y _i ^* |y _1:i-1 ^* ) (equation 27)

Wherein y is _1:N ^* Is a region description from TargetText, length N, p is the probability predicted by the description generator, y _i ^* The character representing the region description position i.

The method for densely describing the images based on the window self-attention network and the multi-scale features comprises the steps of setting the window size as alpha in the ith layer of the transitioner in the step 2, and inputting the input features V of the network layer _i Filling the edges to make the size of the edges be integral multiple of the window size, and uniformly dividing the filled features into a plurality of window feature sets with equal size, namely V _i ^’ The method comprises the steps of carrying out a first treatment on the surface of the Then, the window feature set V _i ^’ By three ofThe full join layer computes a query vector q _i Key value vector k _i Weight vector v _i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:

q _i =Div(Linear(V _i ^’ ) Nhead) (equation 1)

k _i =Div(Linear(V _i ^’ ) Nhead) (equation 2)

v _i =Div(Linear(V _i ^’ ) Nhead) (equation 3)

Query vector q _i And key value vector k _i The transposed vector k of (2) _i ^T Multiplying and processing with softmax function, calculating attention matrix Attn between pixels in window _i The formula is as follows:

Attn _i =softmax(q _i ·k _i ^T ) (equation 4)

Set weight vector v _i The number of last dimension is d, attn _i Will be equal to v _i The formula for multiplication is as follows:

A _i+1 =Attn _i /d ^1/2 ·v _i (equation 5)

Calculation result A _i+1 Restoring and inputting the characteristic V according to the position of each window _i The same shape is denoted as A _i+1 ^’ ， A _i+1 ^’ Will pass through the subsequent feed forward network module FFN _i Mapping is carried out so as to learn a better image representation V _i+1 The formula is as follows:

V _i+1 =FFN _i (A _i+1 ^’ )=Linear(ReLU(Linear(A _i+1 ^’ ) ) and (formula 6)

Through multi-layer network calculation, the global characterization V is finally obtained _f 。

The method for densely describing the image based on the window self-attention network and the multi-scale features comprises the following steps of:

step 3.1 feature acquisition with a 1/8 scale, utilizing deconvolution layer with a convolution kernel size of 2For global characterization V _f After up-sampling, mapping is carried out by using a convolution layer with the convolution kernel size of 1, and then the image feature f with the scale of 1/8 is extracted by using a convolution layer with the convolution kernel size of 3 ₁ ；

Step 3.2 feature acquisition with a 1/16 scale, globally characterizing V with a convolution layer with a convolution kernel size of 1 _f Mapping, and extracting image feature f with scale of 1/16 by using convolution layer with convolution kernel size of 3 ₂ ；

Step 3.3 feature acquisition with 1/32 of the scale, global characterization V _f After downsampling by the sampling maximum pooling method, mapping by using a convolution layer with a convolution kernel size of 1, and extracting image features f with a scale of 1/32 by using a convolution layer with a convolution kernel size of 3 ₃ ；

Step 3.4 feature acquisition with 1/64 of the scale for f ₃ Downsampling is carried out by adopting a convolution layer with the convolution kernel size of 2 and the sampling step length of 2, and the image characteristic f with the scale of 1/64 is obtained ₄ ；

Step 3.5 feature acquisition with 1/128 of the scale, f ₄ After processing by using an activation function ReLU, downsampling is carried out by using a convolution layer with a convolution kernel size of 2 and a sampling step length of 2, and an image feature f with a scale of 1/128 is obtained ₅ 。

The method for densely describing the images based on the window self-attention network and the multi-scale features comprises the steps of (5.2) converting the ith layer of the BERT model into the input features H of the network layer _i The query vector Hq is calculated through three full connection layers _i Key value vector Hk _i Weight vector Hv _i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:

Hq _i =Div(Linear(H _i ) Nhead) (equation 19)

Hk _i =Div(Linear(H _i ) Nhead) (equation 20)

Hv _i =Div(Linear(H _i ) Nhead) (equation 21)

Query vector Hq _i And a key value vector Hk _i Is the transposed vector Hk of (a) _i ^T Multiplying and using a softmax functionProcessing, calculating attention matrix HAttn between pixels in window _i The formula is as follows:

HAttn _i =softmax(Hq _i ·Hk _i ^T ) (equation 22)

Set weight vector Hv _i The number of last dimension is hd, HAttn _i Will be in contact with Hv _i The formula for multiplication is as follows:

HA _i+1 =HAttn _i /d ^1/2 ·Hv _i (equation 23)

HA _i+1 Will pass through the subsequent feed forward network module FFN _i Mapping is carried out, so that better multi-modal representation H is learned _i+1 The formula is as follows:

H _i+1 =FFN _i (HA _i+1 ^’ )=Linear(ReLU(Linear(HA _i+1 ^’ ) ) and (formula 24)

Through multi-layer network calculation, the multi-mode characterization H is finally obtained ^’ 。

Compared with the prior art, the invention has obvious beneficial effects, and the technical scheme can be adopted as follows: the invention is composed of a target detector and a region description generator. The object detector is used for exploring a key region of an input image, predicting the space coordinates of the region, and extracting a region characteristic diagram. Inside the object detector, the input image is subjected to learning and extraction of image characterization via a window attention-based feature encoder, which is formed by stacking 12 layers ViT (VisionTransformer) of modules, and in each layer of modules, an image feature map is divided into a plurality of windows of equal size and performs an in-window attention operation. The image global characterization output by the feature encoder calculates 5 image features with different scales through a feature pyramid (FeaturePuyramid), and predicts the position information of the key region through a target detection head. Based on the predicted key region location information, the model will cut region features from the multi-scale features, input to the region description generator. The region description generator takes a pre-trained BERT (Bidirectional Encoder Representation from Transformers) model as a core, and generates the region description in an autoregressive manner according to the input global characterization and region characteristics. According to the invention, vision Transformer is used as a feature extraction network, an image window mechanism is introduced on the basis of Vision Transformer, only the attention in the window is calculated, the performance is superior to that of the traditional convolutional neural network, and the efficiency is higher. The invention uses the convolutional neural network to convert the image feature map into a plurality of features with different scales, and respectively identifies the interested areas by a plurality of parallel detection heads, so that the method can more comprehensively discover the interested areas with different sizes. The invention uses BERT network to generate description in autoregressive mode, which has faster speed and higher description quality.

Drawings

FIG. 1 is a schematic diagram of a target detector of the present invention;

fig. 2 is a schematic diagram of the description generator of the present invention.

Detailed Description

The following describes in detail specific embodiments of the image dense description method proposed by the present invention.

An image dense description method based on a window self-attention network and multi-scale features comprises the following steps:

step 1, coarse processing of an input image X:

the data set used in the training process is a Visual Genome data set, and comprises 108077 pictures in total, wherein each picture frames a plurality of target objects in a manual labeling mode, and a corresponding description is attached to each picture. In the training process, the dividing ratio of the training set and the testing set is set to be 20:1, the iteration number of training is set to be 18 ten thousand times, and in the training stage, the proposed image dense description method processes four pictures at a time;

calculating the size of an input image X, if the height or width of the image is larger than 1024 pixels, cutting and shrinking the X to ensure that the input size is 1024 multiplied by 1024, dividing the X into a plurality of image blocks with the size of 16, and performing coarse processing by utilizing a convolution kernel with the size of 16 to obtain a coarse image characteristic X ^’ The channel number is 768, and the processed coarse image features enter the target detector;

step 2 Global characterization of images V _f Is calculated by (1):

the processed coarse image features enter the target detector as shown in fig. 1. Inputting coarse image features X ^’ By pre-training ViT model as feature encoder of image, global characterization V of image is obtained _f . The ViT model is formed by stacking 12 layers of transducer modules, and a window mechanism is adopted in all the transducer modules except the 3 rd, 6 th, 9 th and 12 th layers to divide the input features into a plurality of windows. The image representation will be divided into windows of size alpha and only the attention between pixels within the window is calculated. The window size α is set to 14.

Taking the i-th layer transducer as an example, setting the window size as alpha, firstly inputting the characteristic V of the network layer _i The filling of the edges will be done such that its size is an integer multiple of the window size. The filled features will be equally divided into several window feature sets of equal size, denoted as V _i ^’ 。

Then, the window feature set V _i ^’ Firstly, the query vector q is calculated through three full connection layers _i Key value vector k _i Weight vector v _i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:

q _i =Div(Linear(V _i ^’ ) Nhead) (equation 1)

k _i =Div(Linear(V _i ^’ ) Nhead) (equation 2)

v _i =Div(Linear(V _i ^’ ) Nhead) (equation 3)

Setting h to 12, i.e. query vector q _i Key value vector k _i Weight vector v _i Will be divided evenly into 12 portions.

Attn _i =softmax(q _i ·k _i ^T ) (equation 4)

A _i+1 =Attn _i /d ^1/2 ·v _i (equation 5)

Step 3, multi-scale feature acquisition:

taking the global characterization V obtained in the previous step _f Respectively obtaining a multi-scale feature set F= { F through 5 different convolution neural network branches ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ -to adapt to target detection of different sizes;

step 3.1 feature acquisition with a scale of 1/8, and global characterization V by using a deconvolution layer with a convolution kernel size of 2 _f After up-sampling, mapping is carried out by using a convolution layer with the convolution kernel size of 1, and then the image feature f with the scale of 1/8 is extracted by using a convolution layer with the convolution kernel size of 3 ₁ ；

f ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ The number of channels is unified to 256.

And 4, salient target prediction and regional feature extraction:

step 4.1 significant target prediction, for the multi-scale feature set f= { F ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ Each of the 5 independent predictive network headers is used to identify objects contained in the image features.

Taking the i-th predicted network header as an example, its input is f _i 。f _i The calculation is performed through a convolution network, and the space information of each remarkable target is extracted. Specifically, f _i Local features will be extracted using a convolution layer with a convolution kernel size of 3, then processed through groupnum (group normalization) and finally through the activation function RELU, the formula is as follows:

f _i ^’ =ReLU(GroupNorm(Conv(f _i ) ) and (formula 7)

The above procedure will be repeated 4 times.

An implicit knowledge learning mechanism is introduced to improve the prediction head, and detection of different targets is improved through implicit parameters. Specifically, each predictive network header sets a learnable parameter a _i And M is as follows _i Respectively with f _i ^’ And performing addition and multiplication operations, wherein the formula is as follows:

f _i ^” =( f _i ^’ +A _i )·M _i (equation 8)

bbox _i =ReLU(Conv(f _i ^” ) (equation 9)

agni=Conv(f _i ^” ) (equation 10)

Wherein each spatial coordinate length is 4, representing the abscissa of the upper left corner of the predicted region, the ordinate of the upper left corner, the region length, and the region width, respectively.

CIOU=IOU-ρ ² (g,t)/c ² -betav (formula 11)

IOU= |g n t|/|g u t| (equation 12)

V=4/Π ² （arctan(w ^t /h ^t )-arctan(w/h)） ² (equation 13)

Beta=v/(1-iou+v) (equation 14)

L _dec =1-CIOU (equation 15)

And 4.3, extracting the regional characteristics, and cutting the characteristics of the corresponding region from the multi-scale characteristic set F according to the obvious regional prediction result BBOX of the target detector. For all the region features of the cut, it is recombined into a 4-dimensional tensor R with a channel number of 256. The tensor R is mapped to 768 dimensions by the fully connected layer.

Step 5, generating image dense description:

step 5.1 text feature T ^’ According to the Target set determined in the step 4.2 being Target, collecting natural language description corresponding to the region from the training data set, and defining the natural language description set as TargetText; the word embedding layer using the pre-trained BERT model converts TargetText into word vector features, defined as T. The number of word vector feature T-lanes is 768. For word vector feature T, word vector position encoding PE is calculated ⁿ And superimposed on the word vector feature T to finally obtain the text feature T ^’ The formula is as follows:

T ^’ =T+ PE ⁿ (equation 17)

Wherein pos is [1,2, … ], i is [0,1, …, n/2];

H=Concat(V _f , R ^’ , T ^’ ) (equation 18)

The structure of the description generator is shown in fig. 2. And taking the multi-modal characteristic H as input, and fusing multi-modal information by using a pre-training BERT model. The BERT model is formed by stacking 6 layers of convectors, and the input characteristics H of the network layer are input to the ith layer of convectors _i The query vector Hq is calculated through three full connection layers _i Key value vector Hk _i Weight vector Hv _i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:

Hq _i =Div(Linear(H _i ) Nhead) (equation 19)

Hk _i =Div(Linear(H _i ) Nhead) (equation 20)

Hv _i =Div(Linear(H _i ) Nhead) (equation 21)

Query vector Hq _i And a key value vector Hk _i Is the transposed vector Hk of (a) _i ^T Multiplying and processing with softmax function, calculating attention matrix HAttn between pixels in window _i The formula is as follows:

HAttn _i =softmax(Hq _i ·Hk _i ^T ) (equation 22)

HA _i+1 =HAttn _i /hd ^1/2 ·Hv _i (equation 23)

Model built-in word table scale E _voc Set to 30522, H is set using the full connection layer ^’ Mapping to high-dimensional E _voc And processing with softmax function to define output result as pro _l The formula is as follows:

pro _l =softmax(Linear(H ^’ ) (equation 25)

Step 5.3 description Generator training and loss function, to prevent the description Generator training from fitting too much, regularizing with Label smoothing technique, first, taking Natural languageThe description set TargetText is converted into a length E _voc Is used for calculating a loss function; performing label smoothing on the single thermal code, defining a certain single thermal code as h, and obtaining a label smoothing result as h ^’ The label smoothing process formula is as follows:

h ^’ =(1.0-eps)·h+eps/E _voc (equation 26)

Wherein eps is a smaller constant customized by the technical proposal and is set to 10 ^-12 。

L _ce =-Σ _i=1 ^N log(p(y _i ^* |y _1:i-1 ^* ) (equation 27)

In the prediction stage, the image dense description method performs generation of dense descriptions on one image at a time. The working process of the model for extracting the regional characteristics is the same as that of the training stage. In this stage, there is no need to calculate the loss value L _dec . Furthermore, the input to the description generator is only the global feature V _f And high-dimensional regional feature R ^’ Thereby generating corresponding dense description results.

The dense description result generated in the prediction stage contains the position information of the prediction salient region and the corresponding region description. And screening out the prediction results with high confidence and drawing the prediction results on the input image.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and variation of the above embodiment according to the technical matter of the present invention still fall within the scope of the technical scheme of the present invention.

Claims

1. An image dense description method based on window self-attention and multi-scale mechanism comprises the following steps:

and 4, salient target prediction and regional feature extraction:

f _i ^’ =ReLU(GroupNorm(Conv(f _i )))

the above procedure will be repeated 4 times;

f _i ^” =( f _i ^’ +A _i )·M _i

bbox _i =ReLU(Conv(f _i ^” ))

agn _i =Conv(fi”)

CIOU=IOU-ρ ² (g,t)/c ² -βv

IOU=|g∩t|/|g∪t|

V=4/Π ² （arctan(w ^t /m ^t )-arctan(w/m)） ²

β= v / (1-IOU+v)

L _dec =1-CIOU

step 5, generating image dense description:

PE ⁿ ={PE _(pos,2i) =sin(pos/1000 ^(2i/n) ), PE ^(pos,2i+1) =cos(pos/1000 ^(2i/n) )}

T ^’ =T+ PE ⁿ

wherein pos is [1,2, … ], i is [0,1, …, n/2];

H=Concat(V _f , R ^’ , T ^’ )

Defining the scale of the word list built in the model as E _voc Using a fully connected layer to connect H ^’ Mapping to high-dimensional E _voc And processing with softmax function to define output result as pro _l The formula is as follows:

pro _l =softmax(Linear(H ^’ ))

h ^’ =(1.0-eps)·h+eps/E _voc

wherein eps is a smaller constant customized by the technical proposal and is set to 10 ^-12 ；

L _ce =-Σ _i=1 ^N log(p(y _i ^* |y _1:i-1 ^* ))

2. The method for densely describing images based on window self-attention and multi-scale mechanism according to claim 1, wherein the global characterization V is obtained in step 2 _f The method of (2) is as follows: in each layer of the transducer module, setting a layer as an ith layer of transducer, setting a window size as alpha, and inputting the input characteristic V of the network layer _i Filling the edges to make the size of the edges be integral multiple of the window size, and dividing the filled features into a plurality of window feature sets with equal size, namely V _i ^’ The method comprises the steps of carrying out a first treatment on the surface of the Then, the window feature set V _i ^’ The query vector q is calculated through three full connection layers _i Key value vector k _i Weight vector v _i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:

q _i =Div(Linear(V _i ^’ ),nhead)

k _i =Div(Linear(V _i ^’ ),nhead)

v _i =Div(Linear(V _i ^’ ),nhead)

Attn _i =softmax(q _i ·k _i ^T )

A _i+1 =Attn _i /d ^1/2 ·v _i

V _i+1 =FFN _i (A _i+1 ^’ )=Linear(ReLU(Linear(A _i+1 ^’ )))

3. The method for densely describing the images based on the window self-attention and the multi-scale mechanism as claimed in claim 1, wherein in the step 3, the multi-scale feature set is obtained through 5 different convolution neural network branches as follows:

step 3.1 feature acquisition with a scale of 1/8, and global characterization V by using a deconvolution layer with a convolution kernel size of 2 _f After upsampling, the convolution kernel size is usedMapping the convolution layer with the size of 1, and extracting the image feature f with the size of 1/8 by using the convolution layer with the size of 3 as the convolution kernel ₁ ；

4. The method for densely describing images based on window self-attention and multi-scale mechanism according to claim 1, wherein the self-attention calculating method for the input multi-modal feature H in step 5.2 is as follows: input features H of the network layer are determined by adopting an i-layer transducer of the BERT model _i The query vector Hq is calculated through three full connection layers _i Key value vector Hk _i Weight vector Hv _i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:

Hq _i =Div(Linear(H _i ),nhead)

Hk _i =Div(Linear(H _i ),nhead)

Hv _i =Div(Linear(H _i ),nhead)

query vector Hq _i And a key value vector Hk _i Is the transposed vector Hk of (a) _i ^T Multiplying by each otherAnd processed using a softmax function to calculate an attention matrix HAttn between pixels within the window _i The formula is as follows:

HAttn _i =softmax(Hq _i ·Hk _i ^T )

HA _i+1 =HAttn _i /hd ^1/2 ·Hv _i

H _i+1 =FFN _i (HA _i+1 )=Linear(ReLU(Linear(HA _i+1 )))