WO2024037664A1 - Referring target detection and positioning method based on dynamic adaptive reasoning - Google Patents

Referring target detection and positioning method based on dynamic adaptive reasoning Download PDF

Info

Publication number
WO2024037664A1
WO2024037664A1 PCT/CN2023/123906 CN2023123906W WO2024037664A1 WO 2024037664 A1 WO2024037664 A1 WO 2024037664A1 CN 2023123906 W CN2023123906 W CN 2023123906W WO 2024037664 A1 WO2024037664 A1 WO 2024037664A1
Authority
WO
WIPO (PCT)
Prior art keywords
reasoning
text
reward
round
image
Prior art date
Application number
PCT/CN2023/123906
Other languages
French (fr)
Chinese (zh)
Inventor
张艳宁
王鹏
张志鹏
魏至民
Original Assignee
西北工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西北工业大学 filed Critical 西北工业大学
Publication of WO2024037664A1 publication Critical patent/WO2024037664A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Definitions

  • the invention belongs to the field of multi-modal visual language technology, and specifically relates to a method for detecting and positioning a reference target.
  • Referential target detection and positioning refers to a method of locating target areas in images based on natural language description. That is, for an image and a corresponding text language description, it is hoped that the machine can perform fusion reasoning based on the multi-modal information of the language and image, and automatically determine the target area in the image corresponding to the text language description.
  • This requires the machine to comprehensively understand complex natural language semantic information and visual scene information, and deeply mine multi-modal implicit semantic coupling relationships through multi-step reasoning. It is one of the basic research to realize machine intelligence in artificial intelligence. It has a wide range of application scenarios, such as autonomous navigation of robots. Home robots must first automatically find and locate the target area in the visual scene based on text description information such as commands. On this basis, they can Perform other actions.
  • reference target detection and positioning is a very important basic link in realizing machine intelligence. It has huge practical and commercial value and has attracted a lot of attention in academia and industry in recent years.
  • the text language description varies in length, it can be a word, a phrase or even a long text, resulting in the need for multi-step reasoning that is not fixed.
  • the implicit coupling between text and image is different in strength, and the number of reasoning steps required is different.
  • Some complex reasoning requires more than 10+ reasoning steps, while some are relatively simple and only require 3-5 steps.
  • existing one-stage methods all use a fixed number of inference steps, which will lead to redundant inference steps and increase time complexity when facing short texts. When faced with long text, the number of inference steps may be insufficient and the final target area may not be fully determined, resulting in incorrect results.
  • the present invention provides a referential target detection and positioning method based on dynamic adaptive reasoning, using the DarkNet pre-training model based on the convolutional neural network for images and BERT for text. Pre-train the model to extract image and language representations respectively, use the multi-modal fusion attention mechanism to fuse features of image and text information, and finally use the reinforcement learning reward mechanism algorithm to perform dynamic adaptive reasoning to detect and locate the pointed target in the image. Location.
  • This invention achieves higher accuracy and running speed, and has made outstanding progress compared with the previous model in terms of accuracy and speed.
  • Step 1 Encoding features of text and image information
  • Step 1-1 The image is encoded by the Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V.
  • W and H are the width and height of the image respectively
  • v k refers to the k-th image block area of the entire image feature V;
  • Step 1-2 Use the BERT pre-trained model to encode text features
  • n the word at the nth position in the sentence
  • e n the word vector of each word
  • d the dimension of the word vector
  • Step 2 Multi-modal feature fusion based on attention mechanism
  • Step 2-1 Input E and V into the multi-modal feature fusion module based on the attention mechanism
  • the multi-modal feature fusion module based on the attention mechanism includes a language text attention module under visual control and a visual fusion feature enhancement module under text control; during the t-th reasoning, after updating the multi-modal feature fusion module, the multi-modal feature fusion module The modal feature fusion module outputs V t and
  • Step 2-2 For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative and The calculation is as follows:
  • i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V t-1 output by the previous round t-1 round of inference, ⁇ refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;
  • the word vector is updated after the tth round as:
  • Step 2-3 For the visual fusion feature enhancement module under text control, use the multi-head self-attention module to dialogue Fusion of image features;
  • [:] refers to the concat operation
  • ConvBNReLU refers to the convolution, BatchNormalize and ReLU activation function operations
  • Resize refers to the change size operation
  • Step 2-4 The final prediction output is t x , t y , t w , t h , conf, where t x , t y , t w , and t h are the position information of the prediction box in the image, and conf is the confidence of the model. ;
  • Step 3 Use a dynamic reward mechanism for reasoning
  • Step 3-1 For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;
  • actions_prob which is to continue reasoning or stop reasoning.
  • actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e cls refers to the header after BERT encoding vectorCLS;
  • Step 3-2 The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:
  • IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training
  • Step 3-3 Instant Reward Calculate Reward Points for Correct Association:
  • Score t is the correlation score between the visual vector and the text description word vector calculated in the tth round.
  • Step 3-4 In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:
  • Step 3-5 Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.
  • the d is 768 dimensions.
  • the present invention uses an innovative and efficient dynamic adaptive reasoning method to achieve reference target detection and positioning. Different from previous models, this model directly uses image and language information to dynamically and continuously fuse inference and predict bounding boxes. It does not require the second-stage generation of a series of candidate frames for the picture, and at the same time solves the needs of the existing one-stage method. Fixed number of inference steps leads to insufficient inference or redundant calculations, thus achieving higher accuracy and running speed. Experimental results show that the model architecture of the present invention has made outstanding progress compared with the previous model in terms of accuracy and speed.
  • Figure 1 is a structural diagram of the method of the present invention.
  • Figure 2 is an actual test effect diagram of three different pictures according to the embodiment of the present invention, among which 1-real frame, 2-the present invention The results obtained by the method, 3 - the results obtained by other best existing methods.
  • Figure 3 is a diagram showing the distribution effect of image attention under different reasoning steps according to the embodiment of the present invention.
  • the present invention provides a reference target detection and positioning method based on dynamic adaptive reasoning.
  • This method can dynamically determine the number of reasoning steps based on text and image features, with faster speed and higher accuracy.
  • the system consists of four parts.
  • the first part is the feature encoding process of text and image information
  • the second part is the multi-modal feature fusion process based on the attention mechanism
  • the third part is the automatic determination of the number of steps. Dynamic reasoning process.
  • the pre-trained Darknet-53 based on the convolutional neural network is used to feature encode the image information
  • the BERT pre-trained model is used to feature encode the text information.
  • a multi-modal fusion inference mechanism based on the attention mechanism is used to infer different words in text information for visual features to enhance the weight of key information words and enhance the target area features of visual information through text information.
  • reinforcement learning is used to propose a dynamic reward mechanism to control whether each step of reasoning is correct and dynamically judge whether this round of reasoning is sufficient. If it is not sufficient, continue iterative reasoning in the second part. If the image and text features have been fully integrated at this time Stop reasoning when you get the correct answer.
  • a reference target detection and positioning method based on dynamic adaptive reasoning including the following steps:
  • Step 1 Encoding features of text and image information
  • Step 1-1 The image is encoded by Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V.
  • W and H are the width and height of the image respectively
  • v k refers to the k-th image block area of the entire image feature V;
  • Step 1-2 Use the BERT pre-trained model to encode text features
  • n the word at the nth position in the sentence
  • e n the word vector of each word
  • d the dimension of the word vector
  • Step 2 Multi-modal feature fusion based on attention mechanism
  • Step 2-1 Input E and V into the multi-modal feature fusion module based on the attention mechanism
  • the multi-modal feature fusion module based on the attention mechanism includes the language text attention module under visual control and the visual fusion feature enhancement module under text control; the multi-modal feature fusion module is integrated with the inference step, and each inference is When updating the fusion inference of the two parts of the feature fusion module, at the tth inference, after updating the multi-modal feature fusion module, the multi-modal feature fusion module outputs V t and
  • Step 2-2 For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative To avoid the problem of the model forgetting historical reasoning scores, and The calculation is as follows:
  • i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V t-1 output by the previous round t-1 round of inference, ⁇ refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;
  • the word vector is updated after the tth round as:
  • Step 2-3 For the visual fusion feature enhancement module under text control, in order to establish a deeper connection between language and images, a multi-head self-attention module is used to fuse language and image features;
  • the visual vector is updated through the fusion control of text vector; in this way, the visual vector output by each round of reasoning has strong coupling association and text information, thus ensuring The validity of the reasoning; details are as follows:
  • Step 2-4 The final prediction output is t x , y t , tw , t h , conf, where t x , t y , t w , and t h are the position information of the prediction box in the image, and conf is the confidence of the model. ; Splice the features obtained from the cross-modal attention module. After cross-modal fusion, only rely on the output V t of the visual part;
  • Step 3 Use a dynamic reward mechanism for reasoning
  • Step 3-1 The above steps of fusion reasoning require multiple iterations. For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;
  • actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e cls refers to the head vector CLS encoded by BERT;
  • Step 3-2 The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:
  • IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training (because the data set does not have a fixed number of reasoning steps for each sample).
  • the real box value is 0 during testing, so IoU is fixed at 1;
  • Step 3-3 The immediate reward is to stimulate the positive influence in the training inference process.
  • the above fusion module is to make the weight of each word more closely related to the characteristics of different visual areas. Therefore, Instant Reward calculates the reward score with the correct association:
  • Score t is the correlation score between the visual vector and the text description word vector calculated in the tth round.
  • Step 3-4 In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:
  • Step 3-5 Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.
  • the number of words in the longest sentence is stipulated to be 20.
  • the word vectors in the text description after position encoding are input into the BERT network to obtain the feature vector of the fused sentence. e ⁇ R d , where e is the representation of each word, d is the word vector dimension, 768 dimensions, and the maximum value of N is 20.
  • the image features are expanded into 512 dimensions, and the language features are also expanded through the MLP network (20x512), and position encoding is added, and then they are input into the multi-modal attention module together.
  • This module consists of two parts, the inference weight enhancement of visual features on different words in text information and the enhancement of regional features of text information on visual information.
  • the text information part mainly calculates the weight w n of words in different positions, in which the formula (1) of the attention module is used for model calculation.
  • the initial instant reward score is 1.
  • the visual information part uses weighted text word vectors to splice the language features and image features that have been enhanced by the features in the previous stage, and input them into the multi-head self-attention layer.
  • the number of multi-head self-attention layers is 6, and the number of attention heads is 8.
  • the gating function consists of two reward functions and a visual candidate frame. It is only activated when both incentives are equal to 1 and the confidence of the prediction frame is 1.
  • the entire training process is end-to-end training, using four training sets: RefCOCO, RefCOCO+, RefCOCOg, and ReferReasoning as indicators for model training and evaluation.
  • the batch size is set to 8, initially
  • the learning rate is set to 1e-4.
  • the model is trained on 8*TitanX GPU for 100 generations.
  • the learning rate of training is halved every 10 epochs, and the Adam method is used for gradient descent.
  • the text corresponding to the leftmost image in Figure 2 is "A gray laptop computer placed on the desktop with the page open", and the text corresponding to the middle image is "Bear in the woods and A cub on a rock and a cub climbing a tree.”
  • the text corresponding to the image on the right is "Sandwich with lettuce hanging on the upper left corner of the bread.”
  • the experimental results show the detection and positioning of referent targets based on dynamic adaptive reasoning. It can efficiently give the accurate position of the sentence description information in the image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in the present invention is a referring target detection and positioning method based on dynamic adaptive reasoning. A DarkNet pre-training model based on a convolutional neural network is used for an image and a BERT pre-training model is used for text, so as to respectively extract picture and language representations; feature fusion is performed on image and text information by using a multi-modal fusion attention mechanism; and finally dynamic adaptive reasoning is performed by using a reinforcement learning reward mechanism algorithm, thus detecting and positioning the location of a referring target in the image. By means of the present invention, a higher accuracy and a faster operation speed are obtained, and compared with previous models, the models in the present invention have made prominent progress in terms of precision and speed.

Description

一种基于动态自适应推理的指称目标检测定位方法A referential target detection and positioning method based on dynamic adaptive reasoning 技术领域Technical field
本发明属于多模态视觉语言技术领域,具体涉及一种指称目标检测定位方法。The invention belongs to the field of multi-modal visual language technology, and specifically relates to a method for detecting and positioning a reference target.
背景技术Background technique
指称目标检测定位是指基于自然语言描述进行图像中目标区域定位的方法。即对于一幅图像和相应的一段文本语言描述,希望机器能够根据语言和图像的多模态信息进行融合推理,自动地确定文本语言描述所对应的图像中目标区域。这要求机器需要全面理解复杂的自然语言语义信息和视觉场景信息,通过多步推理深度挖掘多模态隐式语义耦合关系。它是人工智能中实现机器智能的基础研究之一,其应用场景广泛,例如机器人自主导航,家用机器人要先根据命令等文本描述信息,自动查找定位视觉场景中的目标区域,在此基础上方能执行其他操作。也可应用于视觉问答、视觉对话等其他视觉语言多模态任务。因此,指称目标检测定位是实现机器智能中非常重要的一个基础环节,存在着巨大的实用和商业价值,近年来在学术界和工业界引起很多人的关注。Referential target detection and positioning refers to a method of locating target areas in images based on natural language description. That is, for an image and a corresponding text language description, it is hoped that the machine can perform fusion reasoning based on the multi-modal information of the language and image, and automatically determine the target area in the image corresponding to the text language description. This requires the machine to comprehensively understand complex natural language semantic information and visual scene information, and deeply mine multi-modal implicit semantic coupling relationships through multi-step reasoning. It is one of the basic research to realize machine intelligence in artificial intelligence. It has a wide range of application scenarios, such as autonomous navigation of robots. Home robots must first automatically find and locate the target area in the visual scene based on text description information such as commands. On this basis, they can Perform other actions. It can also be applied to other visual language multi-modal tasks such as visual question answering and visual dialogue. Therefore, reference target detection and positioning is a very important basic link in realizing machine intelligence. It has huge practical and commercial value and has attracted a lot of attention in academia and industry in recent years.
早期指称目标检测定位大多采用二阶段的方法,先依赖于目标检测器,提取一组候选区域,进而从候选区域中挑选概率最高的目标区域作为最终答案。后来人们发现两阶段方法都会受第一阶段限制,如目标不能在第一阶段被识别,那么第二阶段就无效。另外,时间复杂度上,这导致候选区域带来大量冗余特征计算的使得计算成本相当巨大。近些年,研究人员提出使用一阶段方法直接进行图像全局特征提取,再根据文本信息进行多步融合和推理,来确定图像中具体的区域位置。但由于文本语言描述长短不一,其可以是一个单词,短语甚至一段长文本,导致需要多步推理并不固定。实际上,文本-图像隐式耦合强弱不同,需要的推理步数不同,有的复杂推理需要10+以上推理步数,而有的较为简单只需3-5步。而现有的一阶段方法都采用固定推理步数,这会导致面对短文本,推理步数冗余,增加时间复杂度。而面对长文本,可能推理步数不足,还未能充分确定最终目标区域,导致结果错误。Most of the early referential target detection and positioning adopt a two-stage method, which first relies on the target detector to extract a set of candidate areas, and then selects the target area with the highest probability from the candidate areas as the final answer. Later, it was discovered that the two-stage method will be limited by the first stage. If the target cannot be recognized in the first stage, then the second stage will be invalid. In addition, in terms of time complexity, this leads to a large number of redundant feature calculations in the candidate area, making the calculation cost quite huge. In recent years, researchers have proposed using a one-stage method to directly extract global features of the image, and then perform multi-step fusion and inference based on text information to determine the specific region location in the image. However, because the text language description varies in length, it can be a word, a phrase or even a long text, resulting in the need for multi-step reasoning that is not fixed. In fact, the implicit coupling between text and image is different in strength, and the number of reasoning steps required is different. Some complex reasoning requires more than 10+ reasoning steps, while some are relatively simple and only require 3-5 steps. However, existing one-stage methods all use a fixed number of inference steps, which will lead to redundant inference steps and increase time complexity when facing short texts. When faced with long text, the number of inference steps may be insufficient and the final target area may not be fully determined, resulting in incorrect results.
发明内容Contents of the invention
为了克服现有技术的不足,本发明提供了一种基于动态自适应推理的指称目标检测定位方法,为图像采用基于卷积神经网络的DarkNet预训练模型和文本采用BERT 预训练模型来分别提取图片和语言表征,利用多模态融合注意力机制对图像和文本信息进行特征融合,最终利用强化学习奖励机制算法进行动态自适应推理,检测定位所指目标在图像中的位置。本发明获得了更高的准确率及运行速度,在精度和速度方面都较之前的模型有了突出的进步。In order to overcome the shortcomings of the existing technology, the present invention provides a referential target detection and positioning method based on dynamic adaptive reasoning, using the DarkNet pre-training model based on the convolutional neural network for images and BERT for text. Pre-train the model to extract image and language representations respectively, use the multi-modal fusion attention mechanism to fuse features of image and text information, and finally use the reinforcement learning reward mechanism algorithm to perform dynamic adaptive reasoning to detect and locate the pointed target in the image. Location. This invention achieves higher accuracy and running speed, and has made outstanding progress compared with the previous model in terms of accuracy and speed.
本发明解决其技术问题所采用的技术方案包括如下步骤:The technical solution adopted by the present invention to solve the technical problems includes the following steps:
步骤1:对文本和图像信息的特征编码;Step 1: Encoding features of text and image information;
步骤1-1:图像经过Darknet-53卷积神经网络编码得到256*256*3维度,整个图像特征向量并记为V,其中W和H分别是图像的宽和高,vk是指整个图像特征V的第k个图像块区域;Step 1-1: The image is encoded by the Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V. Where W and H are the width and height of the image respectively, v k refers to the k-th image block area of the entire image feature V;
步骤1-2:使用BERT预训练模型进行文本特征编码;Step 1-2: Use the BERT pre-trained model to encode text features;
对于由N个单词组成的文本语言描述,则经过编码后变成en∈Rd,其中n代表句子中第n个位置的词,en为每个单词的词向量,d为词向量的维度;For a textual language description consisting of N words, after encoding, it becomes e n ∈R d , where n represents the word at the nth position in the sentence, e n is the word vector of each word, and d is the dimension of the word vector;
步骤2:基于注意力机制的多模态特征融合;Step 2: Multi-modal feature fusion based on attention mechanism;
步骤2-1:将E和V输入到基于注意力机制的多模态特征融合模块中;Step 2-1: Input E and V into the multi-modal feature fusion module based on the attention mechanism;
基于注意力机制的多模态特征融合模块包括视觉控制下的的语言文本注意力模块和文本控制下的视觉融合特征强化模块;第t次推理时,在更新多模态特征融合模块后,多模态特征融合模块分别输出Vt The multi-modal feature fusion module based on the attention mechanism includes a language text attention module under visual control and a visual fusion feature enhancement module under text control; during the t-th reasoning, after updating the multi-modal feature fusion module, the multi-modal feature fusion module The modal feature fusion module outputs V t and
步骤2-2:对于视觉控制下的的语言文本注意力模块,采用注意力机制构建一个权重分数给每个单词,并引入历史累计计算分数计算如下:
Step 2-2: For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative and The calculation is as follows:
其中,i则是遍历前t-1次的累积计算,是指第n个词在第t轮推理下的权重,是指上一轮t-1轮推理输出的视觉特征向量Vt-1的平均池化,·是指点积运算,为模型不同的学习参数;的取值在0-1范围;Among them, i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V t-1 output by the previous round t-1 round of inference, · refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;
因此词向量经过第t轮更新为:
Therefore, the word vector is updated after the tth round as:
步骤2-3:对于文本控制下的视觉融合特征强化模块,采用多头自注意力模块对语 言及图像特征进行融合;Step 2-3: For the visual fusion feature enhancement module under text control, use the multi-head self-attention module to dialogue Fusion of image features;
使用transformer基本结构,采用6-layer-8-head层,具体如下:
Use the basic structure of transformer and adopt 6-layer-8-head layer, as follows:
其中,是指被权重计算过后的文本词向量,[:]是指concat操作;ConvBNReLU指卷积、BatchNormalize和ReLU激活函数操作;Resize指更改大小操作;in, Refers to the text word vector after weight calculation, [:] refers to the concat operation; ConvBNReLU refers to the convolution, BatchNormalize and ReLU activation function operations; Resize refers to the change size operation;
步骤2-4:最终预测输出为tx,ty,tw,th,conf,其中tx,ty,tw,th为图像中预测框的位置信息,conf为模型的自信度;Step 2-4: The final prediction output is t x , t y , t w , t h , conf, where t x , t y , t w , and t h are the position information of the prediction box in the image, and conf is the confidence of the model. ;
步骤3:采用动态奖励机制进行推理;Step 3: Use a dynamic reward mechanism for reasoning;
步骤3-1:针对不同的文本和图像,提出了动态奖励模块,根据第t轮中的视觉-文本向量现状,决定是否继续推理;Step 3-1: For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;
第t轮中的视觉和文本向量计算如下:
The visual and text vectors in round t are calculated as follows:
其中,action是actions_prob中可能性最高的动作,取继续推理或和停止推理,actions_prob是根据本轮推理的文本向量和视觉向量计算得到预测继续推理的可能性;ecls是指BERT编码后的头向量CLS;Among them, action is the most likely action in actions_prob, which is to continue reasoning or stop reasoning. actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e cls refers to the header after BERT encoding vectorCLS;
使用两种强化学习奖励机制,分别为最终奖励和即时奖励;Use two reinforcement learning reward mechanisms, namely final reward and immediate reward;
步骤3-2:最终奖励是根据计算本轮推理结果与真实框之间差异得出的奖励值,即根据本轮产生的候选框O计算,定义如下:
Step 3-2: The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:
其中,IoU是指训练时计算本轮最终推理中的候选框O和真实训练目标框的差值;Among them, IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training;
步骤3-3:即时奖励计算正确关联下的奖励分数:
Step 3-3: Instant Reward Calculate Reward Points for Correct Association:
其中,Scoret是计算第t轮下视觉向量和文本描述词向量之间的关联分数,最终代表多模态的关联程度是否提升,如果第t步融合推理导致逐步提升正向影响,则为1;反之,则产生惩罚机制;Among them, Score t is the correlation score between the visual vector and the text description word vector calculated in the tth round. Finally, Represents whether the degree of multi-modal correlation is improved. If the t-th step of fusion reasoning leads to a gradual increase in positive influence, it is 1; otherwise, a penalty mechanism is generated;
步骤3-4:为了全局训练动态自适应推理,第t轮动态奖励模块总分数计算如下:
Step 3-4: In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:
使用奖励权重weightt输入到下一轮的融合推理模块的语言文本部分的注意力模块进行下一步推理;Use the reward weight weight t to be input to the attention module of the language text part of the next round of fused reasoning module for the next step of reasoning;
步骤3-5:将CrossEntropyLoss作为训练loss,通过计算图像中每个区域预测框和真实框之间的差异获取;动态奖励模块根据推理过后的视觉特征判断是否继续推理,当满足最终奖励和即时奖励均为正向激活且预测框的自信度为1时停止推理。Step 3-5: Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.
优选地,所述d为768维。Preferably, the d is 768 dimensions.
本发明的有益效果如下:The beneficial effects of the present invention are as follows:
本发明利用一种创新且高效的动态自适应推理方法来实现指称目标检测定位。与以往的模型不同,该模型直接利用图像和语言信息动态地不断融合推理预测边界框,不需要二阶段的首先对图片进行一系列候选框的生成,同时解决现有的一阶段的方法中需要固定推理步数导致推理不足或计算冗余问题,因此获得了更高的准确率及运行速度。实验结果表明,本发明的模型架构在精度和速度方面都较之前的模型有了突出的进步。The present invention uses an innovative and efficient dynamic adaptive reasoning method to achieve reference target detection and positioning. Different from previous models, this model directly uses image and language information to dynamically and continuously fuse inference and predict bounding boxes. It does not require the second-stage generation of a series of candidate frames for the picture, and at the same time solves the needs of the existing one-stage method. Fixed number of inference steps leads to insufficient inference or redundant calculations, thus achieving higher accuracy and running speed. Experimental results show that the model architecture of the present invention has made outstanding progress compared with the previous model in terms of accuracy and speed.
附图说明Description of drawings
图1为本发明方法结构图。Figure 1 is a structural diagram of the method of the present invention.
图2为本发明实施例三幅不同的图片实际测试效果图,其中1-真实框,2-本发明 方法所得结果,3-现有其它最佳方法得到的结果。Figure 2 is an actual test effect diagram of three different pictures according to the embodiment of the present invention, among which 1-real frame, 2-the present invention The results obtained by the method, 3 - the results obtained by other best existing methods.
图3为本发明实施例图像注意力在不同推理步骤下分布效果图。Figure 3 is a diagram showing the distribution effect of image attention under different reasoning steps according to the embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图和实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and examples.
如图1所示,本发明提供了一种基于动态自适应推理的指称目标检测定位方法,该方法能够根据文本和图像特征动态确定推理步数,速度更快,准确率更高。As shown in Figure 1, the present invention provides a reference target detection and positioning method based on dynamic adaptive reasoning. This method can dynamically determine the number of reasoning steps based on text and image features, with faster speed and higher accuracy.
本发明的技术方案:该系统包含四部分,第一部分为对文本和图像信息的特征编码过程,第二部分为基于注意力机制的多模态特征融合过程,第三部分为自动确定步数的动态推理过程。在第一部分中,采用基于卷积神经网络的预训练Darknet-53对图片信息进行特征编码,采用BERT预训练模型对文本信息进行特征编码。第二部分中,采用基于注意力机制的多模态融合推理机制,分别为视觉特征对文本信息中不同词进行推理使其关键信息词权重增强和文本信息对视觉信息的目标区域特征增强。第三部分中,利用强化学习提出动态奖励机制,来控制每步推理是否正确和动态判断此轮推理是否充分,若不充分则继续迭代推理第二部分,若此时图像和文本特征已充分融合推理得到正确答案,则停止推理。Technical solution of the present invention: The system consists of four parts. The first part is the feature encoding process of text and image information, the second part is the multi-modal feature fusion process based on the attention mechanism, and the third part is the automatic determination of the number of steps. Dynamic reasoning process. In the first part, the pre-trained Darknet-53 based on the convolutional neural network is used to feature encode the image information, and the BERT pre-trained model is used to feature encode the text information. In the second part, a multi-modal fusion inference mechanism based on the attention mechanism is used to infer different words in text information for visual features to enhance the weight of key information words and enhance the target area features of visual information through text information. In the third part, reinforcement learning is used to propose a dynamic reward mechanism to control whether each step of reasoning is correct and dynamically judge whether this round of reasoning is sufficient. If it is not sufficient, continue iterative reasoning in the second part. If the image and text features have been fully integrated at this time Stop reasoning when you get the correct answer.
一种基于动态自适应推理的指称目标检测定位方法,包括如下步骤:A reference target detection and positioning method based on dynamic adaptive reasoning, including the following steps:
步骤1:对文本和图像信息的特征编码;Step 1: Encoding features of text and image information;
步骤1-1:图像经过Darknet-53卷积神经网络编码得到256*256*3维度,整个图像特征向量并记为V,其中W和H分别是图像的宽和高,vk是指整个图像特征V的第k个图像块区域;Step 1-1: The image is encoded by Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V. Where W and H are the width and height of the image respectively, v k refers to the k-th image block area of the entire image feature V;
步骤1-2:使用BERT预训练模型进行文本特征编码;Step 1-2: Use the BERT pre-trained model to encode text features;
对于由N个单词组成的文本语言描述,则经过编码后变成en∈Rd,其中n代表句子中第n个位置的词,en为每个单词的词向量,d为词向量的维度;For a textual language description consisting of N words, after encoding, it becomes e n ∈R d , where n represents the word at the nth position in the sentence, e n is the word vector of each word, and d is the dimension of the word vector;
步骤2:基于注意力机制的多模态特征融合;Step 2: Multi-modal feature fusion based on attention mechanism;
步骤2-1:将E和V输入到基于注意力机制的多模态特征融合模块中;Step 2-1: Input E and V into the multi-modal feature fusion module based on the attention mechanism;
基于注意力机制的多模态特征融合模块包括视觉控制下的的语言文本注意力模块和文本控制下的视觉融合特征强化模块;多模态特征融合模块和推理步骤相融合,每一次推理都是在更新特征融合模块两部分的融合推理,第t次推理时,在更新多模态特征融合模块后,多模态特征融合模块分别输出Vt The multi-modal feature fusion module based on the attention mechanism includes the language text attention module under visual control and the visual fusion feature enhancement module under text control; the multi-modal feature fusion module is integrated with the inference step, and each inference is When updating the fusion inference of the two parts of the feature fusion module, at the tth inference, after updating the multi-modal feature fusion module, the multi-modal feature fusion module outputs V t and
步骤2-2:对于视觉控制下的的语言文本注意力模块,采用注意力机制构建一个权重分数给每个单词,并引入历史累计计算分数来避免模型遗忘历史推理分数问题,计算如下:
Step 2-2: For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative To avoid the problem of the model forgetting historical reasoning scores, and The calculation is as follows:
其中,i则是遍历前t-1次的累积计算,是指第n个词在第t轮推理下的权重,是指上一轮t-1轮推理输出的视觉特征向量Vt-1的平均池化,·是指点积运算,为模型不同的学习参数;的取值在0-1范围;Among them, i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V t-1 output by the previous round t-1 round of inference, · refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;
因此词向量经过第t轮更新为:
Therefore, the word vector is updated after the tth round as:
步骤2-3:对于文本控制下的视觉融合特征强化模块,为了更深层次的建立语言及图像之间的联系,采用多头自注意力模块对语言及图像特征进行融合;Step 2-3: For the visual fusion feature enhancement module under text control, in order to establish a deeper connection between language and images, a multi-head self-attention module is used to fuse language and image features;
使用transformer基本结构,采用6-layer-8-head层,通过文本向量的融合控制进行更新视觉向量;通过这种方式,使得每轮推理输出的视觉向量有着强耦合关联和文本信息,进而保证了推理的有效性;具体如下:
Using the basic structure of transformer, using 6-layer-8-head layer, the visual vector is updated through the fusion control of text vector; in this way, the visual vector output by each round of reasoning has strong coupling association and text information, thus ensuring The validity of the reasoning; details are as follows:
其中,是指被权重计算过后的文本词向量,[:]是指concat操作;in, refers to the text word vector after weight calculation, [:] refers to the concat operation;
步骤2-4:最终预测输出为tx,yt,tw,th,conf,其中tx,ty,tw,th为图像中预测框的位置信息,conf为模型的自信度;对从跨模态注意力模块中得到的特征进行拼接,跨模态融合之后,只依赖视觉部分的输出VtStep 2-4: The final prediction output is t x , y t , tw , t h , conf, where t x , t y , t w , and t h are the position information of the prediction box in the image, and conf is the confidence of the model. ; Splice the features obtained from the cross-modal attention module. After cross-modal fusion, only rely on the output V t of the visual part;
步骤3:采用动态奖励机制进行推理;Step 3: Use a dynamic reward mechanism for reasoning;
步骤3-1:上述融合推理的步骤,是要迭代多步。针对不同的文本和图像,提出了动态奖励模块,根据第t轮中的视觉-文本向量现状,决定是否继续推理;Step 3-1: The above steps of fusion reasoning require multiple iterations. For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;
第t轮中的视觉和文本向量计算如下:
The visual and text vectors in round t are calculated as follows:
其中,action决定是否继续推理,actions_prob是根据本轮推理的文本向量和视觉向量计算得到预测继续推理的可能性;ecls是指BERT编码后的头向量CLS;Among them, action determines whether to continue reasoning, actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e cls refers to the head vector CLS encoded by BERT;
使用两种强化学习奖励机制,分别为最终奖励和即时奖励;Use two reinforcement learning reward mechanisms, namely final reward and immediate reward;
步骤3-2:最终奖励是根据计算本轮推理结果与真实框之间差异得出的奖励值,即根据本轮产生的候选框O计算,定义如下:
Step 3-2: The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:
其中,IoU是指训练时计算本轮最终推理中的候选框O和真实训练目标框的差值(因为数据集没有固定的每个样例推理步数),测试时候真实框值为0,故IoU固定为1;Among them, IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training (because the data set does not have a fixed number of reasoning steps for each sample). The real box value is 0 during testing, so IoU is fixed at 1;
步骤3-3:即时奖励是为了激励训练推理过程中的正向影响,上述融合模块是为了让每个词权重和视觉不同区域特征更加紧密关联。因此,即时奖励计算正确关联下的奖励分数:
Step 3-3: The immediate reward is to stimulate the positive influence in the training inference process. The above fusion module is to make the weight of each word more closely related to the characteristics of different visual areas. Therefore, Instant Reward calculates the reward score with the correct association:
其中,Scoret是计算第t轮下视觉向量和文本描述词向量之间的关联分数,最终代表多模态的关联程度是否提升,如果第t步融合推理导致逐步提升正向影响,则为1;反之,则产生惩罚机制;Among them, Score t is the correlation score between the visual vector and the text description word vector calculated in the tth round. Finally, Represents whether the degree of multi-modal correlation is improved. If the t-th step of fusion reasoning leads to a gradual increase in positive influence, it is 1; otherwise, a penalty mechanism is generated;
步骤3-4:为了全局训练动态自适应推理,第t轮动态奖励模块总分数计算如下:
Step 3-4: In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:
使用奖励权重weightt输入到下一轮的融合推理模块的语言文本部分的注意力模块进行下一步推理;Use the reward weight weight t to be input to the attention module of the language text part of the next round of fused reasoning module for the next step of reasoning;
步骤3-5:将CrossEntropyLoss作为训练loss,通过计算图像中每个区域预测框和真实框之间的差异获取;动态奖励模块根据推理过后的视觉特征判断是否继续推理,当满足最终奖励和即时奖励均为正向激活且预测框的自信度为1时停止推理。Step 3-5: Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.
具体实施例:Specific examples:
1、图像特征1. Image features
给定自然场景中一张图片,将整张图片resize到256*256*3尺寸,输入进预训练的特征提取网络Darknet-53对图像特征编码。Given an image in a natural scene, resize the entire image to 256*256*3 size and input it into the pre-trained feature extraction network Darknet-53 to encode the image features.
2、文本特征2. Text features
规定最长的语句词数为20,将经过位置编码后的文本描述中的词向量输入进BERT网络,得到融合语句的特征向量e∈Rd,其中e是每个词的表征,d是词向量维度768维,N最大值为20。The number of words in the longest sentence is stipulated to be 20. The word vectors in the text description after position encoding are input into the BERT network to obtain the feature vector of the fused sentence. e∈R d , where e is the representation of each word, d is the word vector dimension, 768 dimensions, and the maximum value of N is 20.
3、利用注意力机制的多模态特征融合加强3. Multi-modal feature fusion enhancement using attention mechanism
将图像特征展开成512维度,语言特征也通过MLP网络扩成(20x512),并进行位置编码添加,然后才一起输入到多模态注意力模块中。该模块由两部分组成,视觉特征对文本信息中不同词的推理权重增强和文本信息对视觉信息的区域特征增强。文本信息部分主要进行不同位置词权重wn计算,其中利用注意力模块的公式(1)进行模型计算。初始即时奖励分数,分数值为1。视觉信息部分利用被权重化的文本词向量,把经过前一阶段特征加强后的语言特征和图像特征进行拼接,输入到多头自注意力层中。多头自注意力的层数为6层,注意力头的数量为8。The image features are expanded into 512 dimensions, and the language features are also expanded through the MLP network (20x512), and position encoding is added, and then they are input into the multi-modal attention module together. This module consists of two parts, the inference weight enhancement of visual features on different words in text information and the enhancement of regional features of text information on visual information. The text information part mainly calculates the weight w n of words in different positions, in which the formula (1) of the attention module is used for model calculation. The initial instant reward score is 1. The visual information part uses weighted text word vectors to splice the language features and image features that have been enhanced by the features in the previous stage, and input them into the multi-head self-attention layer. The number of multi-head self-attention layers is 6, and the number of attention heads is 8.
4、奖励机制下的动态推理步骤4. Dynamic reasoning steps under the reward mechanism
在得到融合特征的情况下,选取其中的视觉特征部分和权重化的文本部分,其中即时奖励机制和最终奖励机制两种机制计算权重分数返回给文本信息权重推理部分。而门控函数则有两个奖励函数和视觉候选框共同组成,只有当两个激励都等于1,且预测框的自信度为1才激活。When the fused features are obtained, the visual feature part and the weighted text part are selected, and two mechanisms, the immediate reward mechanism and the final reward mechanism, calculate the weight score and return it to the text information weight reasoning part. The gating function consists of two reward functions and a visual candidate frame. It is only activated when both incentives are equal to 1 and the confidence of the prediction frame is 1.
5、模型训练5. Model training
整个训练过程为端到端的训练,采用RefCOCO、RefCOCO+、RefCOCOg、ReferReasoning四个训练集作为模型训练和评价的指标。批处理大小设置为8,初始 学习率设置为1e-4。在8*TitanX GPU上对模型进行100代的训练,每过10个epoch训练的学习率减半,使用Adam方法进行梯度下降。The entire training process is end-to-end training, using four training sets: RefCOCO, RefCOCO+, RefCOCOg, and ReferReasoning as indicators for model training and evaluation. The batch size is set to 8, initially The learning rate is set to 1e-4. The model is trained on 8*TitanX GPU for 100 generations. The learning rate of training is halved every 10 epochs, and the Adam method is used for gradient descent.
6、模型应用6. Model application
在通过上面的训练过程后每步保存可以得到多个模型,选取其中最优的模型(测试集上测试效果最佳)用于应用,对于输入的图像及语句,只需要把图像调整到256x256大小,并且归一化,语句进行分词操作,即可作为模型的输入。整个的网络模型的参数固定不动,只要输入图像数据及语言数据并向前传播即可。模型可以自动根据文本和图像的隐式耦合关联度进行动态推理,最终得到合适推理后预测结果。实际实验图如图2和图3所示,图2中最左边图像对应的文字为“一个放在桌面上打开页面的灰色laptop笔记本电脑”,中间图像对应的文字为“在树林里的熊以及一只在岩石上的幼熊和一只在爬树的幼熊”,右边图像对应的文字为“把生菜挂在面包左上角的三明治”,实验结果显示基于动态自适应推理的指称目标检测定位能够高效的给出有关语句描述信息在图像中的准确位置。 After passing the above training process, you can get multiple models by saving each step. Select the best model (with the best test effect on the test set) for application. For the input images and sentences, you only need to adjust the image to 256x256 size. , and normalized, the sentence is segmented and can be used as the input of the model. The parameters of the entire network model are fixed, as long as the image data and language data are input and propagated forward. The model can automatically perform dynamic reasoning based on the implicit coupling correlation between text and images, and finally obtain appropriate prediction results after reasoning. The actual experimental pictures are shown in Figures 2 and 3. The text corresponding to the leftmost image in Figure 2 is "A gray laptop computer placed on the desktop with the page open", and the text corresponding to the middle image is "Bear in the woods and A cub on a rock and a cub climbing a tree." The text corresponding to the image on the right is "Sandwich with lettuce hanging on the upper left corner of the bread." The experimental results show the detection and positioning of referent targets based on dynamic adaptive reasoning. It can efficiently give the accurate position of the sentence description information in the image.

Claims (2)

  1. 一种基于动态自适应推理的指称目标检测定位方法,其特征在于,包括如下步骤:A reference target detection and positioning method based on dynamic adaptive reasoning, which is characterized by including the following steps:
    步骤1:对文本和图像信息的特征编码;Step 1: Encoding features of text and image information;
    步骤1-1:图像经过Darknet-53卷积神经网络编码得到256*256*3维度,整个图像特征向量并记为V,其中W和H分别是图像的宽和高,vk是指整个图像特征V的第k个图像块区域;Step 1-1: The image is encoded by Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V. Where W and H are the width and height of the image respectively, v k refers to the k-th image block area of the entire image feature V;
    步骤1-2:使用BERT预训练模型进行文本特征编码;Step 1-2: Use the BERT pre-trained model to encode text features;
    对于由N个单词组成的文本语言描述,则经过编码后变成en∈Rd,其中n代表句子中第n个位置的词,en为每个单词的词向量,d为词向量的维度;For a textual language description consisting of N words, after encoding, it becomes e n ∈R d , where n represents the word at the nth position in the sentence, e n is the word vector of each word, and d is the dimension of the word vector;
    步骤2:基于注意力机制的多模态特征融合;Step 2: Multi-modal feature fusion based on attention mechanism;
    步骤2-1:将E和V输入到基于注意力机制的多模态特征融合模块中;Step 2-1: Input E and V into the multi-modal feature fusion module based on the attention mechanism;
    基于注意力机制的多模态特征融合模块包括视觉控制下的的语言文本注意力模块和文本控制下的视觉融合特征强化模块;第t次推理时,在更新多模态特征融合模块后,多模态特征融合模块分别输出Vt The multi-modal feature fusion module based on the attention mechanism includes a language text attention module under visual control and a visual fusion feature enhancement module under text control; during the t-th reasoning, after updating the multi-modal feature fusion module, the multi-modal feature fusion module The modal feature fusion module outputs V t and
    步骤2-2:对于视觉控制下的的语言文本注意力模块,采用注意力机制构建一个权重分数给每个单词,并引入历史累计计算分数计算如下:

    Step 2-2: For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative and The calculation is as follows:

    其中,i则是遍历前t-1次的累积计算,是指第n个词在第t轮推理下的权重,是指上一轮t-1轮推理输出的视觉特征向量Vt-1的平均池化,·是指点积运算,为模型不同的学习参数;的取值在0-1范围;Among them, i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V t-1 output by the previous round t-1 round of inference, · refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;
    因此词向量经过第t轮更新为:
    Therefore, the word vector is updated after the tth round as:
    步骤2-3:对于文本控制下的视觉融合特征强化模块,采用多头自注意力模块对语言及图像特征进行融合;Step 2-3: For the visual fusion feature enhancement module under text control, a multi-head self-attention module is used to fuse language and image features;
    使用transformer基本结构,采用6-layer-8-head层,具体如下:
    Use the basic structure of transformer and adopt 6-layer-8-head layer, as follows:
    其中,是指被权重计算过后的文本词向量,[:]是指concat操作;ConvBNReLU指卷积、BatchNormalize和ReLU激活函数操作;Resize指更改大小操作;in, Refers to the text word vector after weight calculation, [:] refers to the concat operation; ConvBNReLU refers to the convolution, BatchNormalize and ReLU activation function operations; Resize refers to the change size operation;
    步骤2-4:最终预测输出为tx,ty,tw,th,conf,其中tx,ty,tw,th为图像中预测框的位置信息,conf为模型的自信度;Step 2-4: The final prediction output is t x , t y , t w , t h , conf, where t x , t y , t w , and t h are the position information of the prediction box in the image, and conf is the confidence of the model. ;
    步骤3:采用动态奖励机制进行推理; Step 3: Use a dynamic reward mechanism for reasoning;
    步骤3-1:针对不同的文本和图像,提出了动态奖励模块,根据第t轮中的视觉-文本向量现状,决定是否继续推理;Step 3-1: For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;
    第t轮中的视觉和文本向量计算如下:
    The visual and text vectors in round t are calculated as follows:
    其中,action是actions_prob中可能性最高的动作,取继续推理或和停止推理,actions_prob是根据本轮推理的文本向量和视觉向量计算得到预测继续推理的可能性;ecls是指BERT编码后的头向量CLS;Among them, action is the most likely action in actions_prob, which is to continue reasoning or stop reasoning. actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e cls refers to the header after BERT encoding vectorCLS;
    使用两种强化学习奖励机制,分别为最终奖励和即时奖励;Use two reinforcement learning reward mechanisms, namely final reward and immediate reward;
    步骤3-2:最终奖励是根据计算本轮推理结果与真实框之间差异得出的奖励值,即根据本轮产生的候选框O计算,定义如下:
    Step 3-2: The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:
    其中,IoU是指训练时计算本轮最终推理中的候选框O和真实训练目标框的差值;Among them, IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training;
    步骤3-3:即时奖励计算正确关联下的奖励分数:
    Step 3-3: Instant Reward Calculate Reward Points for Correct Association:
    其中,Scoret是计算第t轮下视觉向量和文本描述词向量之间的关联分数,最终代表多模态的关联程度是否提升,如果第t步融合推理导致逐步提升正向影响,则为1;反之,则产生惩罚机制;Among them, Score t is the correlation score between the visual vector and the text description word vector calculated in the tth round. Finally, Represents whether the degree of multi-modal correlation is improved. If the t-th step of fusion reasoning leads to a gradual increase in positive influence, it is 1; otherwise, a penalty mechanism is generated;
    步骤3-4:为了全局训练动态自适应推理,第t轮动态奖励模块总分数计算如下:
    Step 3-4: In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:
    使用奖励权重weightt输入到下一轮的融合推理模块的语言文本部分的注意力模块进行下一步推理;Use the reward weight weight t to be input to the attention module of the language text part of the next round of fused reasoning module for the next step of reasoning;
    步骤3-5:将CrossEntropyLoss作为训练loss,通过计算图像中每个区域预测框和真实框之间的差异获取;动态奖励模块根据推理过后的视觉特征判断是否继续推理,当满足最终奖励和即时奖励均为正向激活且预测框的自信度为1时停止推理。Step 3-5: Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.
  2. 根据权利要求1所述的一种基于动态自适应推理的指称目标检测定位方法,其特征 在于,所述d为768维。 A referential target detection and positioning method based on dynamic adaptive reasoning according to claim 1, characterized by That is, the d is 768 dimensions.
PCT/CN2023/123906 2022-10-20 2023-10-11 Referring target detection and positioning method based on dynamic adaptive reasoning WO2024037664A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211286108.2 2022-10-20
CN202211286108.2A CN115661842A (en) 2022-10-20 2022-10-20 Dynamic adaptive inference-based nominal target detection and positioning method

Publications (1)

Publication Number Publication Date
WO2024037664A1 true WO2024037664A1 (en) 2024-02-22

Family

ID=84989042

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/123906 WO2024037664A1 (en) 2022-10-20 2023-10-11 Referring target detection and positioning method based on dynamic adaptive reasoning

Country Status (2)

Country Link
CN (1) CN115661842A (en)
WO (1) WO2024037664A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875407A (en) * 2024-03-11 2024-04-12 中国兵器装备集团自动化研究所有限公司 Multi-mode continuous learning method, device, equipment and storage medium
CN118314169A (en) * 2024-04-16 2024-07-09 华东师范大学 Visual target tracking method based on multi-mode large language model
CN118379592A (en) * 2024-06-21 2024-07-23 浙江核新同花顺网络信息股份有限公司 Clothing detection method, device and equipment for virtual person and readable storage medium
CN118429899A (en) * 2024-07-03 2024-08-02 杭州梯度安全服务有限公司 Zero-order learning intelligent early warning method based on multi-mode deep learning drive
CN118506107A (en) * 2024-07-17 2024-08-16 烟台大学 Robot classification detection method and system based on multi-mode and multi-task learning

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115661842A (en) * 2022-10-20 2023-01-31 西北工业大学 Dynamic adaptive inference-based nominal target detection and positioning method
CN117196546B (en) * 2023-11-08 2024-07-09 杭州实在智能科技有限公司 RPA flow executing system and method based on page state understanding and large model driving

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241191A (en) * 2021-12-19 2022-03-25 西北工业大学 Cross-modal self-attention-based non-candidate-box expression understanding method
CN115062174A (en) * 2022-06-16 2022-09-16 电子科技大学 End-to-end image subtitle generating method based on semantic prototype tree
CN115661842A (en) * 2022-10-20 2023-01-31 西北工业大学 Dynamic adaptive inference-based nominal target detection and positioning method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241191A (en) * 2021-12-19 2022-03-25 西北工业大学 Cross-modal self-attention-based non-candidate-box expression understanding method
CN115062174A (en) * 2022-06-16 2022-09-16 电子科技大学 End-to-end image subtitle generating method based on semantic prototype tree
CN115661842A (en) * 2022-10-20 2023-01-31 西北工业大学 Dynamic adaptive inference-based nominal target detection and positioning method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG ZHIPENG, WEI ZHIMIN, HUANG ZHONGZHEN, NIU RUI, WANG PENG: "One for all: One-stage referring expression comprehension with dynamic reasoning", NEUROCOMPUTING, ELSEVIER, AMSTERDAM, NL, vol. 518, 27 October 2022 (2022-10-27), AMSTERDAM, NL, pages 523 - 532, XP093140789, ISSN: 0925-2312, DOI: 10.1016/j.neucom.2022.10.022 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117875407A (en) * 2024-03-11 2024-04-12 中国兵器装备集团自动化研究所有限公司 Multi-mode continuous learning method, device, equipment and storage medium
CN117875407B (en) * 2024-03-11 2024-06-04 中国兵器装备集团自动化研究所有限公司 Multi-mode continuous learning method, device, equipment and storage medium
CN118314169A (en) * 2024-04-16 2024-07-09 华东师范大学 Visual target tracking method based on multi-mode large language model
CN118379592A (en) * 2024-06-21 2024-07-23 浙江核新同花顺网络信息股份有限公司 Clothing detection method, device and equipment for virtual person and readable storage medium
CN118429899A (en) * 2024-07-03 2024-08-02 杭州梯度安全服务有限公司 Zero-order learning intelligent early warning method based on multi-mode deep learning drive
CN118506107A (en) * 2024-07-17 2024-08-16 烟台大学 Robot classification detection method and system based on multi-mode and multi-task learning

Also Published As

Publication number Publication date
CN115661842A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
WO2024037664A1 (en) Referring target detection and positioning method based on dynamic adaptive reasoning
KR102532749B1 (en) Method and apparatus for hierarchical learning of neural networks based on weak supervised learning
CN108416065B (en) Hierarchical neural network-based image-sentence description generation system and method
CN112560432B (en) Text emotion analysis method based on graph attention network
CN113705597B (en) Image processing method, device, computer equipment and readable storage medium
CN109299262A (en) A kind of text implication relation recognition methods for merging more granular informations
CN113158875A (en) Image-text emotion analysis method and system based on multi-mode interactive fusion network
CN112650886B (en) Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN111444968A (en) Image description generation method based on attention fusion
CN105718890A (en) Method for detecting specific videos based on convolution neural network
CN114882488B (en) Multisource remote sensing image information processing method based on deep learning and attention mechanism
CN112527966A (en) Network text emotion analysis method based on Bi-GRU neural network and self-attention mechanism
WO2024032010A1 (en) Transfer learning strategy-based real-time few-shot object detection method
CN117034961B (en) BERT-based medium-method inter-translation quality assessment method
CN114925232B (en) Cross-modal time domain video positioning method under text segment question-answering framework
CN116186250A (en) Multi-mode learning level mining method, system and medium under small sample condition
CN114780775B (en) Image description text generation method based on content selection and guiding mechanism
CN110727844A (en) Online commented commodity feature viewpoint extraction method based on generation countermeasure network
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN114332288B (en) Method for generating text generation image of confrontation network based on phrase drive and network
CN116662591A (en) Robust visual question-answering model training method based on contrast learning
CN112269876A (en) Text classification method based on deep learning
KR102331803B1 (en) Vision and language navigation system
CN116151226B (en) Machine learning-based deaf-mute sign language error correction method, equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23854559

Country of ref document: EP

Kind code of ref document: A1