CN116543146B - Image dense description method based on window self-attention and multi-scale mechanism - Google Patents
Image dense description method based on window self-attention and multi-scale mechanism Download PDFInfo
- Publication number
- CN116543146B CN116543146B CN202310822911.1A CN202310822911A CN116543146B CN 116543146 B CN116543146 B CN 116543146B CN 202310822911 A CN202310822911 A CN 202310822911A CN 116543146 B CN116543146 B CN 116543146B
- Authority
- CN
- China
- Prior art keywords
- feature
- image
- scale
- region
- follows
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000007246 mechanism Effects 0.000 title claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 31
- 238000012512 characterization method Methods 0.000 claims abstract description 29
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000001514 detection method Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 25
- 238000013507 mapping Methods 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 102100025444 Gamma-butyrobetaine dioxygenase Human genes 0.000 claims description 6
- 101000934612 Homo sapiens Gamma-butyrobetaine dioxygenase Proteins 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000003930 cognitive ability Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/34—Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image dense description method based on window self-attention and multi-scale mechanism, which is formed by combining a target detector and a region description generator, wherein an input image is subjected to image characterization learning and extraction through a window attention-based feature encoder in the target detector, the feature encoder is formed by stacking 12 layers ViT modules, an image feature image is divided into a plurality of windows with equal size and performs attention operation in the windows in each layer of modules, the feature encoder calculates 5 image features with different scales, the position information of a key region is predicted through a target detection head, the region features are cut out from the multi-scale features according to the model, the region description generator adopts a pre-training BERT model as a core, and the region description is generated in an autoregressive mode according to the input global characterization and the region features. The invention can accurately capture a plurality of key objects of the image and generate high-quality description.
Description
Technical Field
The invention relates to the field of computer vision and natural language processing, in particular to an image dense description method based on a window self-attention network and multi-scale features.
Background
Image dense description (dense image capture) is a superior task to the open world object detection task, which requires a model to detect salient regions on the input image and describe the region content using short sentences, an artificial intelligence method that combines computer vision technology with natural language processing technology.
Compared with the conventional target detection method at present, the image dense description method has stronger image recognition capability and wider object recognition range, and has the capability of recognizing the object category outside the training set. In the working process, the image dense description method adopts a form of human language to describe the recognition object, is more close to the thinking mode of the human cognitive world, and is an important technology for building strong artificial intelligence in the future.
In contrast to the conventional image description method, the dense image description method does not understand and summarize the global content of the image, but locates multiple rois (regions of interest) of the image and generates descriptions respectively. The working mode can more effectively retain the key information of the image and transfer the content of interest to the user.
Image dense description techniques may be used for image retrieval tasks to search for images that contain particular visual concepts or scenes by generating natural language descriptions for various regions of the image; the image analysis task is used for image understanding analysis tasks and assisting in analyzing and understanding complex images containing a plurality of objects, actions and interactions; for image editing modification tasks, the user is assisted in editing and manipulating images by providing natural language commands based on dense descriptions.
At present, the image dense description method is mainly based on convolutional neural network to realize feature extraction of an input image, and the cyclic neural network is utilized to correspondingly describe the regional features. This approach, while easy to implement, has several problems:
1. the convolutional neural network has limitations on image extraction and cognitive ability, and has strong local feature extraction ability, but the global characterization of the image is difficult to master. In addition, when the convolutional neural network architecture is too complex, the training difficulty of the model is obviously increased, and the performance is also difficult to improve. These drawbacks result in the inability of dense description models based on convolutional neural networks to process complex input images;
2. because the cyclic neural network cannot realize the generation of the region description in parallel, and the calculation amount and the time consumption are huge, the image dense description process is long, and the efficiency is low. Furthermore, recurrent neural networks have natural drawbacks for long sequence information, resulting in poor quality of the generated description.
Chinese patent application publication No. CN114037831a discloses an image depth dense description method, system and storage medium, in 2022, 02 and 11, which uses a basic convolutional neural network for extraction, and has low performance and low efficiency. The patent application directly utilizes the RPN network to extract the region of interest on the image feature map, and the development of the regions of interest with different sizes is not comprehensive enough; the patent application uses an LSTM network to generate descriptions for each region of interest, the speed is general, and the description quality is low.
Disclosure of Invention
The invention aims to overcome the defects and provide an image dense description method based on a window self-care network and multi-scale features, which can accurately capture a plurality of key objects of an image and generate high-quality description.
The invention discloses an image dense description method based on a window self-attention network and multi-scale features, which comprises the following steps:
step 1, input image X coarse processing: inputting an image X with the size equal to 1024 multiplied by 1024, dividing the X into a plurality of image blocks with the size of k, and performing coarse processing by using a convolution kernel with the size of k to obtain a coarse image characteristic X';
step 2 Global characterization of images V f Is calculated by (1): inputting the rough image feature X', and obtaining the global characterization V of the image by pre-training ViT model as a feature encoder of the image f The ViT model is formed by stacking multiple layers of transducer modules, in each layer of transducer module, an image representation is divided into windows with the size alpha, the attention among pixels in the windows is calculated only, and the global representation V is finally obtained through multi-layer network calculation f ;
Step 3, multi-scale feature acquisition: taking the global characterization V obtained in the previous step f Respectively obtaining a multi-scale feature set F= { F through 5 different convolution neural network branches 1 ,f 2 ,f 3 ,f 4 ,f 5 -to adapt to target detection of different sizes;
and 4, salient target prediction and regional feature extraction:
step 4.1 significant target prediction, for the multi-scale feature set f= { F 1 ,f 2 ,f 3 ,f 4 ,f 5 Identifying targets contained in the image features by using 5 independent prediction network heads respectively;
an i-th predictive network header with an input f i ,f i Local features are extracted using a convolution layer with a convolution kernel size of 3, then processed through groupnum (group normalization), and finally through an activation function RELU, the formula is as follows:
f i ’=ReLU(GroupNorm(Conv(f i ) ) and (formula 7)
The above procedure will be repeated 4 times;
each prediction network head sets a leavable parameter A i And M is as follows i Respectively with f i ' perform the addition and multiplication operations, the formula is as follows:
f i ”=(f i ’+A i )·M i (equation 8)
For spatial feature f i ", two convolution network branches are respectively adopted to obtain the predicted space coordinate bbox under the scale i And confidence agn i The formula is as follows:
bbox i =ReLU(Conv(f i ") (equation 9)
agni=conv (fi ") (equation 10)
Step 4.2 target detector training and loss function, for the target detector's salient region prediction result bbox= { BBOX 1 ,bbox 2 ,bbox 3 ,bbox 4 ,bbox 5 Finding out a Target closest to the distance between the training data set and each predicted result, and defining the Target set as a Target; measuring the difference between the predicted result and the actual target by using a measurement index CIOU, wherein the CIOU is in direct proportion to the performance of the target detector; defining a prediction area as g and an actual target area as t, and calculating the CIOU by the following formula:
CIOU=IOU-ρ 2 (g,t)/c 2 -betav (formula 11)
IOU= |g n t|/|g u t| (equation 12)
V=4/Π 2 (arctan(w t /m t )-arctan(w/m)) 2 (equation 13)
Beta=v/(1-iou+v) (equation 14)
Wherein ρ represents the Euclidean distance between the center points of the predicted region g and the actual target region t, c represents the diagonal distance of the minimum closure region of the predicted region g and the actual target region t, and w t And m t The width and the height of the actual target area t are respectively represented, and w and m represent the width and the height of the predicted area g respectively;
loss function L for training a target detector dec The calculation formula is as follows:
L dec =1-CIOU (equation 15)
Step 4.3, extracting regional characteristics, namely cutting the characteristics of the corresponding region from a multi-scale characteristic set F according to a remarkable regional prediction result BBOX of the target detector, wherein the regional characteristic set is marked as R;
step 5, generating image dense description:
step 5.1 text feature TExtracting, namely collecting natural language description corresponding to the region from the training data set according to the Target set determined in the step 4.2 as Target, and defining the natural language description set as TargetText; converting the TargetText into word vector features using a word embedding layer of a pre-trained BERT model, defined as T; for n-dimensional word vector feature T, word vector position encoding PE is calculated n And is superimposed on the word vector feature T to finally obtain a text feature T', the formula is as follows:
PE n ={PE (pos,2i) =sin(pos/1000 (2i/n) ),PE (pos,2i+1) =cos(pos/1000 (2i/n) ) ' formula 16)
T’=T+PE n (equation 17)
Wherein pos is [1,2, … ], i is [0,1, …, n/2];
step 5.2, describing generation, namely mapping the region feature set R into a high-dimensional space by using a full connection layer, and marking the region feature set as a high-dimensional region feature R'; global characterization V f The high-dimensional regional feature R 'and the text feature T' are spliced to obtain a multi-modal feature H, and the formula is as follows:
H=Concat(V f r ', T') (equation 18)
The description generator takes the multi-modal characteristic H as input, uses a pre-training BERT model to fuse multi-modal information, the BERT model is formed by stacking a plurality of layers of transformers network layers, the calculation process of the transformers of each layer carries out self-attention calculation on the input multi-modal characteristic H, and the calculation result of the BERT model is recorded as H';
defining the scale of the word list built in the model as E voc Mapping H' to high-dimensional E using full connectivity layer voc And processing with softmax function to define output result as pro l The formula is as follows:
pro l =softmax (Linear (H')) (formula 25)
Wherein l is the maximum length of the generated region description; pro l i Defining as generating area description the predictive probability of each word at the ith position, taking the word corresponding to the maximum probability as the candidate word w of the position i The method comprises the steps of carrying out a first treatment on the surface of the Finally, the region description w= { W is generated 1 ,w 2 ,…,w l };
Step 5.3 description Generator training and loss function, taking the natural language description set TargetText, converting it to Length E voc Is used for calculating a loss function; and performing label smoothing treatment on the single thermal code, defining a certain single thermal code as h, and defining a label smoothing treatment result as h', wherein the label smoothing process formula is as follows:
h’=(1.0-eps)·h+eps/E voc (equation 26)
Wherein eps is a smaller constant customized by the technical scheme;
after this is done, the loss L of the dense description is calculated using the cross entropy function ce ;L ce The calculation formula of (2) is as follows:
L ce =-Σ i=1 N log(p(y i * |y 1:i-1 * ) (equation 27)
Wherein y is 1:N * Is a region description from TargetText, length N, p is the probability predicted by the description generator, y i * The character representing the region description position i.
The method for densely describing the images based on the window self-attention network and the multi-scale features comprises the steps of setting the window size as alpha in the ith layer of the transitioner in the step 2, and inputting the input features V of the network layer i Filling the edges to make the size of the edges be integral multiple of the window size, and uniformly dividing the filled features into a plurality of window feature sets with equal size, namely V i 'A'; then, the window feature set V i ' query vector q is computed by three full-join layers i Key value vector k i Weight vector v i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:
q i =Div(Linear(V i '), nhead) (equation 1)
k i =Div(Linear(V i '), nhead) (equation 2)
v i =Div(Linear(V i '), nhead) (formula3)
Query vector q i And key value vector k i The transposed vector k of (2) i T Multiplying and processing with softmax function, calculating attention matrix Attn between pixels in window i The formula is as follows:
Attn i =softmax(q i ·k i T ) (equation 4)
Set weight vector v i The number of last dimension is d, attn i Will be equal to v i The formula for multiplication is as follows:
A i+1 =Attn i /d 1/2 ·v i (equation 5)
Calculation result A i+1 Restoring and inputting the characteristic V according to the position of each window i The same shape is denoted as A i+1 ’,A i+1 ' will pass through subsequent feed forward network modules FFN i Mapping is carried out so as to learn a better image representation V i+1 The formula is as follows: v (V) i+1 =FFN i (A i+1 ’)=Linear(ReLU(Linear(A i+1 ')) (equation 6)
Through multi-layer network calculation, the global characterization V is finally obtained f 。
The method for densely describing the image based on the window self-attention network and the multi-scale features comprises the following steps of:
step 3.1 feature acquisition with a scale of 1/8, and global characterization V by using a deconvolution layer with a convolution kernel size of 2 f After up-sampling, mapping is carried out by using a convolution layer with the convolution kernel size of 1, and then the image feature f with the scale of 1/8 is extracted by using a convolution layer with the convolution kernel size of 3 1 ;
Step 3.2 feature acquisition with a 1/16 scale, globally characterizing V with a convolution layer with a convolution kernel size of 1 f Mapping, and extracting image feature f with scale of 1/16 by using convolution layer with convolution kernel size of 3 2 ;
Step 3.3 feature acquisition with 1/32 of the scale, global characterization V f After downsampling by the sampling maximum pooling method, mapping by using a convolution layer with a convolution kernel size of 1, and extracting image features f with a scale of 1/32 by using a convolution layer with a convolution kernel size of 3 3 ;
Step 3.4 feature acquisition with 1/64 of the scale for f 3 Downsampling is carried out by adopting a convolution layer with the convolution kernel size of 2 and the sampling step length of 2, and the image characteristic f with the scale of 1/64 is obtained 4 ;
Step 3.5 feature acquisition with 1/128 of the scale, f 4 After processing by using an activation function ReLU, downsampling is carried out by using a convolution layer with a convolution kernel size of 2 and a sampling step length of 2, and an image feature f with a scale of 1/128 is obtained 5 。
The method for densely describing the images based on the window self-attention network and the multi-scale features comprises the steps of (5.2) converting the ith layer of the BERT model into the input features H of the network layer i The query vector Hq is calculated through three full connection layers i Key value vector Hk i Weight vector Hv i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:
Hq i =Div(Linear(H i ) Nhead) (equation 19)
Hk i =Div(Linear(H i ) Nhead) (equation 20)
Hv i =Div(Linear(H i ) Nhead) (equation 21)
Query vector Hq i And a key value vector Hk i Is the transposed vector Hk of (a) i T Multiplying and processing with softmax function, calculating attention matrix HAttn between pixels in window i The formula is as follows:
HAttn i =softmax(Hq i ·Hk i T ) (equation 22)
Set weight vector Hv i The number of last dimension is hd, HAttn i Will be in contact with Hv i The formula for multiplication is as follows:
HA i+1 =HAttn i /d 1/2 ·Hv i (equation 23)
HA i+1 Will pass through the subsequent feed forward network module FFN i Mapping is carried out, so that better multi-modal representation H is learned i+1 The formula is as follows:
H i+1 =FFN i (HA i+1 ’)=Linear(ReLU(Linear(HA i+1 ')) (formula 24)
And finally obtaining the multi-mode representation H' through multi-layer network calculation.
Compared with the prior art, the invention has obvious beneficial effects, and the technical scheme can be adopted as follows: the invention is composed of a target detector and a region description generator. The object detector is used for exploring a key region of an input image, predicting the space coordinates of the region, and extracting a region characteristic diagram. Inside the object detector, the input image is subjected to learning and extraction of image characterization via a window attention-based feature encoder, which is formed by stacking 12 layers ViT (VisionTransformer) of modules, and in each layer of modules, an image feature map is divided into a plurality of windows of equal size and performs an in-window attention operation. The image global characterization output by the feature encoder calculates 5 image features with different scales through a feature pyramid (FeaturePuyramid), and predicts the position information of the key region through a target detection head. Based on the predicted key region location information, the model will cut region features from the multi-scale features, input to the region description generator. The region description generator takes a pre-trained BERT (Bidirectional Encoder Representation from Transformers) model as a core, and generates the region description in an autoregressive manner according to the input global characterization and region characteristics. According to the invention, vision Transformer is used as a feature extraction network, an image window mechanism is introduced on the basis of Vision Transformer, only the attention in the window is calculated, the performance is superior to that of the traditional convolutional neural network, and the efficiency is higher. The invention uses the convolutional neural network to convert the image feature map into a plurality of features with different scales, and respectively identifies the interested areas by a plurality of parallel detection heads, so that the method can more comprehensively discover the interested areas with different sizes. The invention uses BERT network to generate description in autoregressive mode, which has faster speed and higher description quality.
Drawings
FIG. 1 is a schematic diagram of a target detector of the present invention;
fig. 2 is a schematic diagram of the description generator of the present invention.
Detailed Description
The following describes in detail specific embodiments of the image dense description method proposed by the present invention.
An image dense description method based on a window self-attention network and multi-scale features comprises the following steps:
step 1, coarse processing of an input image X:
the data set used in the training process is a Visual Genome data set, and comprises 108077 pictures in total, wherein each picture frames a plurality of target objects in a manual labeling mode, and a corresponding description is attached to each picture. In the training process, the dividing ratio of the training set and the testing set is set to be 20:1, the iteration number of training is set to be 18 ten thousand times, and in the training stage, the proposed image dense description method processes four pictures at a time;
calculating the size of an input image X, if the height or width of the image is larger than 1024 pixels, cutting and shrinking the X to ensure that the input size is 1024 multiplied by 1024, dividing the X into a plurality of image blocks with the size of 16, and performing coarse processing by utilizing a convolution kernel with the size of 16 to obtain a coarse image feature X', wherein the channel number is 768, and the processed coarse image feature enters a target detector;
step 2 Global characterization of images V f Is calculated by (1):
the processed coarse image features enter the target detector as shown in fig. 1. Inputting the rough image feature X', and obtaining the global characterization V of the image by pre-training ViT model as a feature encoder of the image f . The ViT model is formed by stacking 12 layers of transducer modules, and a window mechanism is adopted in all the transducer modules except the 3 rd, 6 th, 9 th and 12 th layers to divide the input features into a plurality of windows. The image representation will be divided into windows of size alpha, only calculatedAttention between pixels within the window. The window size α is set to 14.
Taking the i-th layer transducer as an example, setting the window size as alpha, firstly inputting the characteristic V of the network layer i The filling of the edges will be done such that its size is an integer multiple of the window size. The filled features will be equally divided into several window feature sets of equal size, denoted as V i ’。
Then, the window feature set V i ' first, the query vector q is calculated through three full-connection layers i Key value vector k i Weight vector v i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:
q i =Div(Linear(V i '), nhead) (equation 1)
k i =Div(Linear(V i '), nhead) (equation 2)
v i =Div(Linear(V i '), nhead) (equation 3)
Setting h to 12, i.e. query vector q i Key value vector k i Weight vector v i Will be divided evenly into 12 portions.
Query vector q i And key value vector k i The transposed vector k of (2) i T Multiplying and processing with softmax function, calculating attention matrix Attn between pixels in window i The formula is as follows:
Attn i =softmax(q i ·k i T ) (equation 4)
Set weight vector v i The number of last dimension is d, attn i Will be equal to v i The formula for multiplication is as follows:
A i+1 =Attn i /d 1/2 ·v i (equation 5)
Calculation result A i+1 Restoring and inputting the characteristic V according to the position of each window i The same shape is denoted as A i+1 ’,A i+1 ' will pass through subsequent feed forward network modules FFN i Mapping is carried out so as to learn a better image representation V i+1 Formulas such asThe following steps: v (V) i+1 =FFN i (A i+1 ’)=Linear(ReLU(Linear(A i+1 ')) (equation 6)
Through multi-layer network calculation, the global characterization V is finally obtained f 。
Step 3, multi-scale feature acquisition:
taking the global characterization V obtained in the previous step f Respectively obtaining a multi-scale feature set F= { F through 5 different convolution neural network branches 1 ,f 2 ,f 3 ,f 4 ,f 5 -to adapt to target detection of different sizes;
step 3.1 feature acquisition with a scale of 1/8, and global characterization V by using a deconvolution layer with a convolution kernel size of 2 f After up-sampling, mapping is carried out by using a convolution layer with the convolution kernel size of 1, and then the image feature f with the scale of 1/8 is extracted by using a convolution layer with the convolution kernel size of 3 1 ;
Step 3.2 feature acquisition with a 1/16 scale, globally characterizing V with a convolution layer with a convolution kernel size of 1 f Mapping, and extracting image feature f with scale of 1/16 by using convolution layer with convolution kernel size of 3 2 ;
Step 3.3 feature acquisition with 1/32 of the scale, global characterization V f After downsampling by the sampling maximum pooling method, mapping by using a convolution layer with a convolution kernel size of 1, and extracting image features f with a scale of 1/32 by using a convolution layer with a convolution kernel size of 3 3 ;
Step 3.4 feature acquisition with 1/64 of the scale for f 3 Downsampling is carried out by adopting a convolution layer with the convolution kernel size of 2 and the sampling step length of 2, and the image characteristic f with the scale of 1/64 is obtained 4 ;
Step 3.5 feature acquisition with 1/128 of the scale, f 4 After processing by using an activation function ReLU, downsampling is carried out by using a convolution layer with a convolution kernel size of 2 and a sampling step length of 2, and an image feature f with a scale of 1/128 is obtained 5 。
f 1 ,f 2 ,f 3 ,f 4 ,f 5 The number of channels is unified to 256.
And 4, salient target prediction and regional feature extraction:
step 4.1 significant target prediction, for the multi-scale feature set f= { F 1 ,f 2 ,f 3 ,f 4 ,f 5 Each of the 5 independent predictive network headers is used to identify objects contained in the image features.
Taking the i-th predicted network header as an example, its input is f i 。f i The calculation is performed through a convolution network, and the space information of each remarkable target is extracted. Specifically, f i Local features will be extracted using a convolution layer with a convolution kernel size of 3, then processed through groupnum (group normalization) and finally through the activation function RELU, the formula is as follows:
f i ’=ReLU(GroupNorm(Conv(f i ) ) and (formula 7)
The above procedure will be repeated 4 times.
An implicit knowledge learning mechanism is introduced to improve the prediction head, and detection of different targets is improved through implicit parameters. Specifically, each predictive network header sets a learnable parameter a i And M is as follows i Respectively with f i ' perform the addition and multiplication operations, the formula is as follows: f (f) i ”=(f i ’+A i )·M i (equation 8)
For spatial feature f i ", two convolution network branches are respectively adopted to obtain the predicted space coordinate bbox under the scale i And confidence agn i The formula is as follows:
bbox i =ReLU(Conv(f i ") (equation 9)
agni=Conv(f i ") (equation 10)
Wherein each spatial coordinate length is 4, representing the abscissa of the upper left corner of the predicted region, the ordinate of the upper left corner, the region length, and the region width, respectively.
Step 4.2 target detector training and loss function, for the target detector's salient region prediction result bbox= { BBOX 1 ,bbox 2 ,bbox 3 ,bbox 4 ,bbox 5 Finding a training dataset with each predicted junctionDefining the Target set as Target; measuring the difference between the predicted result and the actual target by using a measurement index CIOU, wherein the CIOU is in direct proportion to the performance of the target detector; defining a prediction area as g and an actual target area as t, and calculating the CIOU by the following formula:
CIOU=IOU-ρ 2 (g,t)/c 2 -betav (formula 11)
IOU= |g n t|/|g u t| (equation 12)
V=4/Π 2 (arctan(w t /h t )-arctan(w/h)) 2 (equation 13)
Beta=v/(1-iou+v) (equation 14)
Wherein ρ represents the Euclidean distance between the center points of the predicted region g and the actual target region t, c represents the diagonal distance of the minimum closure region of the predicted region g and the actual target region t, and w t And m t The width and the height of the actual target area t are respectively represented, and w and m represent the width and the height of the predicted area g respectively;
loss function L for training a target detector dec The calculation formula is as follows:
L dec =1-CIOU (equation 15)
And 4.3, extracting the regional characteristics, and cutting the characteristics of the corresponding region from the multi-scale characteristic set F according to the obvious regional prediction result BBOX of the target detector. For all the region features of the cut, it is recombined into a 4-dimensional tensor R with a channel number of 256. The tensor R is mapped to 768 dimensions by the fully connected layer.
Step 5, generating image dense description:
step 5.1, extracting text features T', and collecting natural language description corresponding to the region from the training data set according to the Target set determined in step 4.2 as Target, wherein the natural language description set is defined as TargetText; the word embedding layer using the pre-trained BERT model converts TargetText into word vector features, defined as T. The number of word vector feature T-lanes is 768. For word vector feature T, word vector position encoding PE is calculated n And is superimposed on the word vector feature T to finally obtain a text feature T', the formula is as follows:
PE n ={PE (pos,2i) =sin(pos/1000 (2i/n) ),PE (pos,2i+1) =cos(pos/1000 (2i/n) ) ' formula 16)
T’=T+PE n (equation 17)
Wherein pos is [1,2, … ], i is [0,1, …, n/2];
step 5.2, describing generation, namely mapping the region feature set R into a high-dimensional space by using a full connection layer, and marking the region feature set as a high-dimensional region feature R'; global characterization V f The high-dimensional regional feature R 'and the text feature T' are spliced to obtain a multi-modal feature H, and the formula is as follows:
H=Concat(V f r ', T') (equation 18)
The structure of the description generator is shown in fig. 2. And taking the multi-modal characteristic H as input, and fusing multi-modal information by using a pre-training BERT model. The BERT model is formed by stacking 6 layers of convectors, and the input characteristics H of the network layer are input to the ith layer of convectors i The query vector Hq is calculated through three full connection layers i Key value vector Hk i Weight vector Hv i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:
Hq i =Div(Linear(H i ) Nhead) (equation 19)
Hk i =Div(Linear(H i ) Nhead) (equation 20)
Hv i =Div(Linear(H i ) Nhead) (equation 21)
Query vector Hq i And a key value vector Hk i Is the transposed vector Hk of (a) i T Multiplying and processing with softmax function, calculating attention matrix HAttn between pixels in window i The formula is as follows:
HAttn i =softmax(Hq i ·Hk i T ) (equation 22)
Set weight vector Hv i The number of last dimension is hd, HAttn i Will be in contact with Hv i The formula for multiplication is as follows:
HA i+1 =HAttn i /hd 1/2 ·Hv i (equation 23)
HA i+1 Will pass through the subsequent feed forward network module FFN i Mapping is carried out, so that better multi-modal representation H is learned i+1 The formula is as follows:
H i+1 =FFN i (HA i+1 ’)=Linear(ReLU(Linear(HA i+1 ')) (formula 24)
And finally obtaining the multi-mode representation H' through multi-layer network calculation.
Model built-in word table scale E voc Set to 30522, mapping H' to high-dimensional E using full connection layer voc And processing with softmax function to define output result as pro l The formula is as follows:
pro l =softmax (Linear (H')) (formula 25)
Wherein l is the maximum length of the generated region description; pro l i Defining as generating area description the predictive probability of each word at the ith position, taking the word corresponding to the maximum probability as the candidate word w of the position i The method comprises the steps of carrying out a first treatment on the surface of the Finally, the region description w= { W is generated 1 ,w 2 ,…,w l };
Step 5.3, training a description generator and a loss function, and regularizing by adopting a label smoothing technology in order to prevent the description generator from training and fitting; first, the natural language description set TargetText is taken and converted into a length E voc Is used for calculating a loss function; and performing label smoothing treatment on the single thermal code, defining a certain single thermal code as h, and defining a label smoothing treatment result as h', wherein the label smoothing process formula is as follows:
h’=(1.0-eps)·h+eps/E voc (equation 26)
Wherein eps is a smaller constant customized by the technical proposal and is set to 10 -12 。
After this is done, the loss L of the dense description is calculated using the cross entropy function ce ;L ce The calculation formula of (2) is as follows:
L ce =-Σ i=1 N log(p(y i * |y 1:i-1 * ) (equation 27)
Wherein y is 1:N * Is a region description from TargetText, length N, p is the probability predicted by the description generator, y i * The character representing the region description position i.
In the prediction stage, the image dense description method performs generation of dense descriptions on one image at a time. The working process of the model for extracting the regional characteristics is the same as that of the training stage. In this stage, there is no need to calculate the loss value L dec . Furthermore, the input to the description generator is only the global feature V f And a high-dimensional region feature R' to generate a corresponding dense description result.
The dense description result generated in the prediction stage contains the position information of the prediction salient region and the corresponding region description. And screening out the prediction results with high confidence and drawing the prediction results on the input image.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and variation of the above embodiment according to the technical matter of the present invention still fall within the scope of the technical scheme of the present invention.
Claims (3)
1. An image dense description method based on window self-attention and multi-scale mechanism comprises the following steps:
step 1, input image X coarse processing: inputting an image X with the size equal to 1024 multiplied by 1024, dividing the X into a plurality of square image blocks with the side length of k, and performing coarse processing by using a convolution kernel with the size of k multiplied by k to obtain a coarse image characteristic X';
step 2 Global characterization of images V f Is calculated by (1): inputting the rough image feature X', and obtaining the global characterization V of the image by pre-training ViT model as a feature encoder of the image f The ViT model is formed by stacking multiple layers of transducer modules, within each of which the image representation is to be divided into positive sides of length αThe square window only calculates the attention among pixels in the window, and the global characterization V is finally obtained through multi-layer network calculation f Obtaining global characterization V f The method of (2) is as follows:
in each layer of the transducer module, one layer is set as an ith layer of transducer, the window size is set as alpha x alpha, and the input characteristic V of the network layer is set i Filling the edges to make the size of the edges be integral multiple of the window size, and dividing the filled features into a plurality of window feature sets with equal size, namely V i 'A'; then, the window feature set V i ' query vector q is computed by three full-join layers i Key value vector k i Weight vector v i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:
q i =Div(Linear(V i ’),nhead)
k i =Div(Linear(V i ’),nhead)
v i =Div(Linear(V i ’),nhead)
query vector q i And key value vector k i The transposed vector k of (2) i T Multiplying and processing with softmax function, calculating attention matrix Attn between pixels in window i The formula is as follows:
Attn i =softmax(q i ·k i T )
set weight vector v i The number of last dimension is d, attn i Will be equal to v i The formula for multiplication is as follows:
A i+1 =Attn i /d 1/2 ·v i
calculation result A i+1 Restoring and inputting the characteristic V according to the position of each window i The same shape is denoted as A i+1 ’,A i+1 ' will pass through subsequent feed forward network modules FFN i Mapping is carried out so as to learn a better image representation V i+1 The formula is as follows:
V i+1 =FFN i (A i+1 ’)=Linear(ReLU(Linear(A i+1 ’)))
through multi-layer network calculation, the global characterization V is finally obtained f ;
Step 3, multi-scale feature acquisition: taking the global characterization V obtained in the previous step f Respectively obtaining a multi-scale feature set F= { F through 5 different convolution neural network branches 1 ,f 2 ,f 3 ,f 4 ,f 5 -to adapt to target detection of different sizes;
and 4, salient target prediction and regional feature extraction:
step 4.1 significant target prediction, for the multi-scale feature set f= { F 1 ,f 2 ,f 3 ,f 4 ,f 5 Identifying targets contained in the image features by using 5 independent prediction network heads respectively;
an i-th predictive network header with an input f i ,f i Local features are extracted using a convolution layer with a convolution kernel size of 3 x 3, then processed through groupnum (group normalization) and finally through an activation function RELU, the formula is as follows:
f i ’=ReLU(GroupNorm(Conv(f i )))
the above procedure will be repeated 4 times;
each prediction network head sets a leavable parameter A i And M is as follows i Respectively with f i ' perform the addition and multiplication operations, the formula is as follows:
f i ”=(f i ’+A i )·M i
for spatial feature f i ", two convolution network branches are respectively adopted to obtain the predicted space coordinate bbox under the scale i And confidence agn i The formula is as follows:
bbox i =ReLU(Conv(f i ”))
agn i =Conv(f i ”)
step 4.2 target detector training and loss function, for the target detector's salient region prediction result bbox= { BBOX 1 ,bbox 2 ,bbox 3 ,bbox 4 ,bbox 5 Finding training datasetDefining a Target set as a Target, wherein the Target is closest to each predicted result distance; measuring the difference between the predicted result and the actual target by using a measurement index CIOU, wherein the CIOU is in direct proportion to the performance of the target detector; defining a prediction area as g and an actual target area as t, and calculating the CIOU by the following formula:
CIOU=IOU-ρ 2 (g,t)/c 2 -βv
IOU=|g∩t|/|g∪t|
v=4/π 2 (arctan(w t /m t )-arctan(w/m)) 2
β=v/(1-IOU+v)
wherein ρ represents the Euclidean distance between the center points of the predicted region g and the actual target region t, c represents the diagonal distance of the minimum closure region of the predicted region g and the actual target region t, and w t And m t The width and the height of the actual target area t are respectively represented, and w and m represent the width and the height of the predicted area g respectively;
loss function L for training a target detector dec The calculation formula is as follows:
L dec =1-CIOU
step 4.3, extracting regional characteristics, namely cutting the characteristics of the corresponding region from a multi-scale characteristic set F according to a remarkable regional prediction result BBOX of the target detector, wherein the regional characteristic set is marked as R;
step 5, generating image dense description:
step 5.1, extracting text features T', and collecting natural language description corresponding to the region from the training data set according to the Target set determined in step 4.2 as Target, wherein the natural language description set is defined as TargetText; converting the TargetText into word vector features using a word embedding layer of a pre-trained BERT model, defined as T; for n-dimensional word vector feature T, word vector position encoding PE is calculated n And is superimposed on the word vector feature T to finally obtain a text feature T', the formula is as follows:
PE n ={PE (pos,2i) =sin(pos/1000 (2i/n) ),PE (pos,2i+1) =cos(pos/1000 (2i/n) )}
T=T+PE n
wherein pos is [1,2, … ], i is [0,1, …, n/2];
step 5.2, describing generation, namely mapping the region feature set R into a high-dimensional space by using a full connection layer, and marking the region feature set as a high-dimensional region feature R'; global characterization V f The high-dimensional regional feature R 'and the text feature T' are spliced to obtain a multi-modal feature H, and the formula is as follows:
H=Concat(V f ,R’,T’)
the description generator takes the multi-modal characteristic H as input, uses a pre-training BERT model to fuse multi-modal information, the BERT model is formed by stacking a plurality of layers of transformers network layers, the calculation process of the transformers of each layer carries out self-attention calculation on the input multi-modal characteristic H, and the calculation result of the BERT model is recorded as H';
defining the scale of the word list built in the model as E voc Mapping H' to high-dimensional E using full connectivity layer voc And processing with softmax function to define output result as pro l The formula is as follows:
pro l =softmax(Linear(H’))
wherein l is the maximum length of the generated region description; pro l i Defining as generating area description the predictive probability of each word at the ith position, taking the word corresponding to the maximum probability as the candidate word w of the position i The method comprises the steps of carrying out a first treatment on the surface of the Finally, the region description w= { W is generated 1 ,w 2 ,…,w l };
Step 5.3 description Generator training and loss function, taking the natural language description set TargetText, converting it to Length E voc Is used for calculating a loss function; and performing label smoothing treatment on the single thermal code, defining a certain single thermal code as h, and defining a label smoothing treatment result as h', wherein the label smoothing process formula is as follows:
h’=(1.0-eps)·h+eps/E voc
wherein eps is a small constant of 10 -12 ;
After this is done, the loss L of the dense description is calculated using the cross entropy function ce ;L ce The calculation formula of (2) is as follows:
L ce =-Σ i=1 N log(p(y i * |y 1:i-1 * ))
wherein y is 1:N * Is a region description from TargetText, length N, p is the probability predicted by the description generator, y i * The character representing the region description position i.
2. The method for densely describing the images based on the window self-attention and the multi-scale mechanism as claimed in claim 1, wherein in the step 3, the multi-scale feature set is obtained through 5 different convolution neural network branches as follows:
step 3.1 feature acquisition with 1/8 of the scale, global characterization V with deconvolution layer of convolution kernel size 2×2 f After up-sampling, mapping is carried out by using a convolution layer with the convolution kernel size of 1 multiplied by 1, and then the image feature f with the scale of 1/8 is extracted by using a convolution layer with the convolution kernel size of 3 multiplied by 3 1 ;
Step 3.2 feature acquisition with a 1/16 scale, globally characterizing V with a convolution layer with a convolution kernel size of 1×1 f Mapping, and extracting image feature f with scale of 1/16 by using convolution layer with convolution kernel size of 3×3 2 ;
Step 3.3 feature acquisition with 1/32 of the scale, global characterization V f After downsampling by the sampling maximum pooling method, mapping by using a convolution layer with a convolution kernel size of 1 multiplied by 1, and extracting an image feature f with a scale of 1/32 by using a convolution layer with a convolution kernel size of 3 multiplied by 3 3 ;
Step 3.4 feature acquisition with 1/64 of the scale for f 3 Downsampling is carried out by adopting a convolution layer with the convolution kernel size of 2 multiplied by 2 and the sampling step length of 2, and the image characteristic f with the scale of 1/64 is obtained 4 ;
Step 3.5 feature acquisition with 1/128 of the scale, f 4 After processing by using an activation function ReLU, downsampling is performed by using a convolution layer with a convolution kernel size of 2 multiplied by 2 and a sampling step length of 2, and an image feature f with a scale of 1/128 is obtained 5 。
3. The method for densely describing images based on window self-attention and multi-scale mechanism according to claim 1, wherein the self-attention calculating method for the input multi-modal feature H in step 5.2 is as follows: input features H of the network layer are determined by adopting an i-layer transducer of the BERT model i The query vector Hq is calculated through three full connection layers i Key value vector Hk i Weight vector Hv i And uniformly divided into nhead parts along the last dimension, and the formula is as follows:
Hq i =Div(Linear(H i ),nhead)
Hk i =Div(Linear(H i ),nhead)
Hv i =Div(Linear(H i ),nhead)
query vector Hq i And a key value vector Hk i Is the transposed vector Hk of (a) i T Multiplying and processing with softmax function, calculating attention matrix HAttn between pixels in window i The formula is as follows:
HAttn i =softmax(Hq i ·Hk i T )
set weight vector Hv i The number of last dimension is hd, HAttn i Will be in contact with Hv i The formula for multiplication is as follows:
HA i+1 =HAttn i /hd 1/2 ·Hv i
HA i+1 will pass through the subsequent feed forward network module FFN i Mapping is carried out, so that better multi-modal representation H is learned i+1 The formula is as follows:
H i+1 =FFN i (HA i+1 )=Linear(ReLU(Linear(HA i+1 )))
and finally obtaining the multi-mode representation H' through multi-layer network calculation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310822911.1A CN116543146B (en) | 2023-07-06 | 2023-07-06 | Image dense description method based on window self-attention and multi-scale mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310822911.1A CN116543146B (en) | 2023-07-06 | 2023-07-06 | Image dense description method based on window self-attention and multi-scale mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116543146A CN116543146A (en) | 2023-08-04 |
CN116543146B true CN116543146B (en) | 2023-09-26 |
Family
ID=87451029
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310822911.1A Active CN116543146B (en) | 2023-07-06 | 2023-07-06 | Image dense description method based on window self-attention and multi-scale mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116543146B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118212537B (en) * | 2024-05-21 | 2024-07-23 | 贵州大学 | Crop counting method based on quantity supervision |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701495A (en) * | 2016-01-05 | 2016-06-22 | 贵州大学 | Image texture feature extraction method |
US10198671B1 (en) * | 2016-11-10 | 2019-02-05 | Snap Inc. | Dense captioning with joint interference and visual context |
CN109543699A (en) * | 2018-11-28 | 2019-03-29 | 北方工业大学 | Image abstract generation method based on target detection |
CN109670576A (en) * | 2018-11-29 | 2019-04-23 | 中山大学 | A kind of multiple scale vision concern Image Description Methods |
WO2020108165A1 (en) * | 2018-11-30 | 2020-06-04 | 腾讯科技(深圳)有限公司 | Image description information generation method and device, and electronic device |
CN111814844A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | Intensive video description method based on position coding fusion |
WO2021139069A1 (en) * | 2020-01-09 | 2021-07-15 | 南京信息工程大学 | General target detection method for adaptive attention guidance mechanism |
CN113158735A (en) * | 2021-01-20 | 2021-07-23 | 北京工业大学 | Dense event description method based on graph neural network |
CN113674334A (en) * | 2021-07-06 | 2021-11-19 | 复旦大学 | Texture recognition method based on depth self-attention network and local feature coding |
CN113946706A (en) * | 2021-05-20 | 2022-01-18 | 广西师范大学 | Image description generation method based on reference preposition description |
CN114758203A (en) * | 2022-03-31 | 2022-07-15 | 长江三峡技术经济发展有限公司 | Residual dense visual transformation method and system for hyperspectral image classification |
CN115311465A (en) * | 2022-08-10 | 2022-11-08 | 北京印刷学院 | Image description method based on double attention models |
CN115775316A (en) * | 2022-11-23 | 2023-03-10 | 长春理工大学 | Image semantic segmentation method based on multi-scale attention mechanism |
CN116129124A (en) * | 2023-03-16 | 2023-05-16 | 泰州市人民医院 | Image segmentation method, system and equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965705B2 (en) * | 2015-11-03 | 2018-05-08 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering |
-
2023
- 2023-07-06 CN CN202310822911.1A patent/CN116543146B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701495A (en) * | 2016-01-05 | 2016-06-22 | 贵州大学 | Image texture feature extraction method |
US10198671B1 (en) * | 2016-11-10 | 2019-02-05 | Snap Inc. | Dense captioning with joint interference and visual context |
CN109543699A (en) * | 2018-11-28 | 2019-03-29 | 北方工业大学 | Image abstract generation method based on target detection |
CN109670576A (en) * | 2018-11-29 | 2019-04-23 | 中山大学 | A kind of multiple scale vision concern Image Description Methods |
WO2020108165A1 (en) * | 2018-11-30 | 2020-06-04 | 腾讯科技(深圳)有限公司 | Image description information generation method and device, and electronic device |
WO2021139069A1 (en) * | 2020-01-09 | 2021-07-15 | 南京信息工程大学 | General target detection method for adaptive attention guidance mechanism |
CN111814844A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | Intensive video description method based on position coding fusion |
CN113158735A (en) * | 2021-01-20 | 2021-07-23 | 北京工业大学 | Dense event description method based on graph neural network |
CN113946706A (en) * | 2021-05-20 | 2022-01-18 | 广西师范大学 | Image description generation method based on reference preposition description |
CN113674334A (en) * | 2021-07-06 | 2021-11-19 | 复旦大学 | Texture recognition method based on depth self-attention network and local feature coding |
CN114758203A (en) * | 2022-03-31 | 2022-07-15 | 长江三峡技术经济发展有限公司 | Residual dense visual transformation method and system for hyperspectral image classification |
CN115311465A (en) * | 2022-08-10 | 2022-11-08 | 北京印刷学院 | Image description method based on double attention models |
CN115775316A (en) * | 2022-11-23 | 2023-03-10 | 长春理工大学 | Image semantic segmentation method based on multi-scale attention mechanism |
CN116129124A (en) * | 2023-03-16 | 2023-05-16 | 泰州市人民医院 | Image segmentation method, system and equipment |
Non-Patent Citations (3)
Title |
---|
DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition;Jiayu Jiao等;《arXiv》;第1-15页 * |
公共环境的图像描述研究;周宇辉;《中国优秀硕士学位论文全文数据库 信息科技辑》(第2期);第I138-1625页 * |
基于多重注意结构的图像密集描述生成方法研究;刘青茹 等;《自动化学报》;第48卷(第10期);第2537-2548页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116543146A (en) | 2023-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112966684A (en) | Cooperative learning character recognition method under attention mechanism | |
CN111444939B (en) | Small-scale equipment component detection method based on weak supervision cooperative learning in open scene of power field | |
CN109754007A (en) | Peplos intelligent measurement and method for early warning and system in operation on prostate | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
CN111079658A (en) | Video-based multi-target continuous behavior analysis method, system and device | |
Wang et al. | Advanced Multimodal Deep Learning Architecture for Image-Text Matching | |
CN117611576A (en) | Image-text fusion-based contrast learning prediction method | |
Deng | A survey of convolutional neural networks for image classification: Models and datasets | |
CN116543146B (en) | Image dense description method based on window self-attention and multi-scale mechanism | |
CN115965818A (en) | Small sample image classification method based on similarity feature fusion | |
CN115147601A (en) | Urban street point cloud semantic segmentation method based on self-attention global feature enhancement | |
Wang et al. | Self-supervised learning for high-resolution remote sensing images change detection with variational information bottleneck | |
CN114187506B (en) | Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network | |
CN113343966B (en) | Infrared and visible light image text description generation method | |
CN117830874B (en) | Remote sensing target detection method under multi-scale fuzzy boundary condition | |
CN117829243A (en) | Model training method, target detection device, electronic equipment and medium | |
Mars et al. | Combination of DE-GAN with CNN-LSTM for Arabic OCR on Images with Colorful Backgrounds | |
Rezaei et al. | Systematic review of image segmentation using complex networks | |
CN117689932A (en) | InSAR atmospheric phase and earth surface deformation detection method and device based on improved YOLOv7 and computer storage medium | |
Zhao et al. | Object-Preserving Siamese Network for Single-Object Tracking on Point Clouds | |
Su et al. | Mask-vit: an object mask embedding in vision transformer for fine-grained visual classification | |
Sun et al. | Hyperspectral Image Classification based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms | |
CN118379744B (en) | Semi-supervised scene text recognition method, system, equipment and storage medium | |
Zhang et al. | Collaborative learning network for scene text detection | |
CN116503674B (en) | Small sample image classification method, device and medium based on semantic guidance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |