WO2024037664A1

WO2024037664A1 - Referring target detection and positioning method based on dynamic adaptive reasoning

Info

Publication number: WO2024037664A1
Application number: PCT/CN2023/123906
Authority: WO
Inventors: 张艳宁; 王鹏; 张志鹏; 魏至民
Original assignee: 西北工业大学
Priority date: 2022-10-20
Filing date: 2023-10-11
Publication date: 2024-02-22
Also published as: CN115661842A

Abstract

Disclosed in the present invention is a referring target detection and positioning method based on dynamic adaptive reasoning. A DarkNet pre-training model based on a convolutional neural network is used for an image and a BERT pre-training model is used for text, so as to respectively extract picture and language representations; feature fusion is performed on image and text information by using a multi-modal fusion attention mechanism; and finally dynamic adaptive reasoning is performed by using a reinforcement learning reward mechanism algorithm, thus detecting and positioning the location of a referring target in the image. By means of the present invention, a higher accuracy and a faster operation speed are obtained, and compared with previous models, the models in the present invention have made prominent progress in terms of precision and speed.

Description

A referential target detection and positioning method based on dynamic adaptive reasoning

Technical field

The invention belongs to the field of multi-modal visual language technology, and specifically relates to a method for detecting and positioning a reference target.

Background technique

Referential target detection and positioning refers to a method of locating target areas in images based on natural language description. That is, for an image and a corresponding text language description, it is hoped that the machine can perform fusion reasoning based on the multi-modal information of the language and image, and automatically determine the target area in the image corresponding to the text language description. This requires the machine to comprehensively understand complex natural language semantic information and visual scene information, and deeply mine multi-modal implicit semantic coupling relationships through multi-step reasoning. It is one of the basic research to realize machine intelligence in artificial intelligence. It has a wide range of application scenarios, such as autonomous navigation of robots. Home robots must first automatically find and locate the target area in the visual scene based on text description information such as commands. On this basis, they can Perform other actions. It can also be applied to other visual language multi-modal tasks such as visual question answering and visual dialogue. Therefore, reference target detection and positioning is a very important basic link in realizing machine intelligence. It has huge practical and commercial value and has attracted a lot of attention in academia and industry in recent years.

Most of the early referential target detection and positioning adopt a two-stage method, which first relies on the target detector to extract a set of candidate areas, and then selects the target area with the highest probability from the candidate areas as the final answer. Later, it was discovered that the two-stage method will be limited by the first stage. If the target cannot be recognized in the first stage, then the second stage will be invalid. In addition, in terms of time complexity, this leads to a large number of redundant feature calculations in the candidate area, making the calculation cost quite huge. In recent years, researchers have proposed using a one-stage method to directly extract global features of the image, and then perform multi-step fusion and inference based on text information to determine the specific region location in the image. However, because the text language description varies in length, it can be a word, a phrase or even a long text, resulting in the need for multi-step reasoning that is not fixed. In fact, the implicit coupling between text and image is different in strength, and the number of reasoning steps required is different. Some complex reasoning requires more than 10+ reasoning steps, while some are relatively simple and only require 3-5 steps. However, existing one-stage methods all use a fixed number of inference steps, which will lead to redundant inference steps and increase time complexity when facing short texts. When faced with long text, the number of inference steps may be insufficient and the final target area may not be fully determined, resulting in incorrect results.

Contents of the invention

In order to overcome the shortcomings of the existing technology, the present invention provides a referential target detection and positioning method based on dynamic adaptive reasoning, using the DarkNet pre-training model based on the convolutional neural network for images and BERT for text. Pre-train the model to extract image and language representations respectively, use the multi-modal fusion attention mechanism to fuse features of image and text information, and finally use the reinforcement learning reward mechanism algorithm to perform dynamic adaptive reasoning to detect and locate the pointed target in the image. Location. This invention achieves higher accuracy and running speed, and has made outstanding progress compared with the previous model in terms of accuracy and speed.

The technical solution adopted by the present invention to solve the technical problems includes the following steps:

Step 1: Encoding features of text and image information;

Step 1-1: The image is encoded by the Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V. Where W and H are the width and height of the image respectively, v _k refers to the k-th image block area of the entire image feature V;

Step 1-2: Use the BERT pre-trained model to encode text features;

For a textual language description consisting of N words, after encoding, it becomes e _n ∈R ^d , where n represents the word at the nth position in the sentence, e _n is the word vector of each word, and d is the dimension of the word vector;

Step 2: Multi-modal feature fusion based on attention mechanism;

Step 2-1: Input E and V into the multi-modal feature fusion module based on the attention mechanism;

The multi-modal feature fusion module based on the attention mechanism includes a language text attention module under visual control and a visual fusion feature enhancement module under text control; during the t-th reasoning, after updating the multi-modal feature fusion module, the multi-modal feature fusion module The modal feature fusion module outputs V ^t and

Step 2-2: For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative and The calculation is as follows:

Among them, i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V ^t-1 output by the previous round t-1 round of inference, · refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;

Therefore, the word vector is updated after the tth round as:

Step 2-3: For the visual fusion feature enhancement module under text control, use the multi-head self-attention module to dialogue Fusion of image features;

Use the basic structure of transformer and adopt 6-layer-8-head layer, as follows:

in, Refers to the text word vector after weight calculation, [:] refers to the concat operation; ConvBNReLU refers to the convolution, BatchNormalize and ReLU activation function operations; Resize refers to the change size operation;

Step 2-4: The final prediction output is t _x , t _y , t _w , t _h , conf, where t _x , t _y , t _w , and t _h are the position information of the prediction box in the image, and conf is the confidence of the model. ;

Step 3: Use a dynamic reward mechanism for reasoning;

Step 3-1: For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;

The visual and text vectors in round t are calculated as follows:

Among them, action is the most likely action in actions_prob, which is to continue reasoning or stop reasoning. actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e _cls refers to the header after BERT encoding vectorCLS;

Use two reinforcement learning reward mechanisms, namely final reward and immediate reward;

Step 3-2: The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:

Among them, IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training;

Step 3-3: Instant Reward Calculate Reward Points for Correct Association:

Among them, Score ^t is the correlation score between the visual vector and the text description word vector calculated in the tth round. Finally, Represents whether the degree of multi-modal correlation is improved. If the t-th step of fusion reasoning leads to a gradual increase in positive influence, it is 1; otherwise, a penalty mechanism is generated;

Step 3-4: In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:

Use the reward weight weight ^t to be input to the attention module of the language text part of the next round of fused reasoning module for the next step of reasoning;

Step 3-5: Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.

Preferably, the d is 768 dimensions.

The beneficial effects of the present invention are as follows:

The present invention uses an innovative and efficient dynamic adaptive reasoning method to achieve reference target detection and positioning. Different from previous models, this model directly uses image and language information to dynamically and continuously fuse inference and predict bounding boxes. It does not require the second-stage generation of a series of candidate frames for the picture, and at the same time solves the needs of the existing one-stage method. Fixed number of inference steps leads to insufficient inference or redundant calculations, thus achieving higher accuracy and running speed. Experimental results show that the model architecture of the present invention has made outstanding progress compared with the previous model in terms of accuracy and speed.

Description of drawings

Figure 1 is a structural diagram of the method of the present invention.

Figure 2 is an actual test effect diagram of three different pictures according to the embodiment of the present invention, among which 1-real frame, 2-the present invention The results obtained by the method, 3 - the results obtained by other best existing methods.

Figure 3 is a diagram showing the distribution effect of image attention under different reasoning steps according to the embodiment of the present invention.

Detailed ways

The present invention will be further described below in conjunction with the accompanying drawings and examples.

As shown in Figure 1, the present invention provides a reference target detection and positioning method based on dynamic adaptive reasoning. This method can dynamically determine the number of reasoning steps based on text and image features, with faster speed and higher accuracy.

Technical solution of the present invention: The system consists of four parts. The first part is the feature encoding process of text and image information, the second part is the multi-modal feature fusion process based on the attention mechanism, and the third part is the automatic determination of the number of steps. Dynamic reasoning process. In the first part, the pre-trained Darknet-53 based on the convolutional neural network is used to feature encode the image information, and the BERT pre-trained model is used to feature encode the text information. In the second part, a multi-modal fusion inference mechanism based on the attention mechanism is used to infer different words in text information for visual features to enhance the weight of key information words and enhance the target area features of visual information through text information. In the third part, reinforcement learning is used to propose a dynamic reward mechanism to control whether each step of reasoning is correct and dynamically judge whether this round of reasoning is sufficient. If it is not sufficient, continue iterative reasoning in the second part. If the image and text features have been fully integrated at this time Stop reasoning when you get the correct answer.

A reference target detection and positioning method based on dynamic adaptive reasoning, including the following steps:

Step 1: Encoding features of text and image information;

Step 1-1: The image is encoded by Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V. Where W and H are the width and height of the image respectively, v _k refers to the k-th image block area of the entire image feature V;

Step 1-2: Use the BERT pre-trained model to encode text features;

Step 2: Multi-modal feature fusion based on attention mechanism;

The multi-modal feature fusion module based on the attention mechanism includes the language text attention module under visual control and the visual fusion feature enhancement module under text control; the multi-modal feature fusion module is integrated with the inference step, and each inference is When updating the fusion inference of the two parts of the feature fusion module, at the tth inference, after updating the multi-modal feature fusion module, the multi-modal feature fusion module outputs V ^t and

Step 2-2: For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative To avoid the problem of the model forgetting historical reasoning scores, and The calculation is as follows:

Therefore, the word vector is updated after the tth round as:

Step 2-3: For the visual fusion feature enhancement module under text control, in order to establish a deeper connection between language and images, a multi-head self-attention module is used to fuse language and image features;

Using the basic structure of transformer, using 6-layer-8-head layer, the visual vector is updated through the fusion control of text vector; in this way, the visual vector output by each round of reasoning has strong coupling association and text information, thus ensuring The validity of the reasoning; details are as follows:

in, refers to the text word vector after weight calculation, [:] refers to the concat operation;

Step 2-4: The final prediction output is t _x , y _t , _tw , t _h , conf, where t _x , t _y , t _w , and t _h are the position information of the prediction box in the image, and conf is the confidence of the model. ; Splice the features obtained from the cross-modal attention module. After cross-modal fusion, only rely on the output V ^t of the visual part;

Step 3: Use a dynamic reward mechanism for reasoning;

Step 3-1: The above steps of fusion reasoning require multiple iterations. For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;

The visual and text vectors in round t are calculated as follows:

Among them, action determines whether to continue reasoning, actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e _cls refers to the head vector CLS encoded by BERT;

Among them, IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training (because the data set does not have a fixed number of reasoning steps for each sample). The real box value is 0 during testing, so IoU is fixed at 1;

Step 3-3: The immediate reward is to stimulate the positive influence in the training inference process. The above fusion module is to make the weight of each word more closely related to the characteristics of different visual areas. Therefore, Instant Reward calculates the reward score with the correct association:

Specific examples:

1. Image features

Given an image in a natural scene, resize the entire image to 256*256*3 size and input it into the pre-trained feature extraction network Darknet-53 to encode the image features.

2. Text features

The number of words in the longest sentence is stipulated to be 20. The word vectors in the text description after position encoding are input into the BERT network to obtain the feature vector of the fused sentence. e∈R ^d , where e is the representation of each word, d is the word vector dimension, 768 dimensions, and the maximum value of N is 20.

3. Multi-modal feature fusion enhancement using attention mechanism

The image features are expanded into 512 dimensions, and the language features are also expanded through the MLP network (20x512), and position encoding is added, and then they are input into the multi-modal attention module together. This module consists of two parts, the inference weight enhancement of visual features on different words in text information and the enhancement of regional features of text information on visual information. The text information part mainly calculates the weight w _n of words in different positions, in which the formula (1) of the attention module is used for model calculation. The initial instant reward score is 1. The visual information part uses weighted text word vectors to splice the language features and image features that have been enhanced by the features in the previous stage, and input them into the multi-head self-attention layer. The number of multi-head self-attention layers is 6, and the number of attention heads is 8.

4. Dynamic reasoning steps under the reward mechanism

When the fused features are obtained, the visual feature part and the weighted text part are selected, and two mechanisms, the immediate reward mechanism and the final reward mechanism, calculate the weight score and return it to the text information weight reasoning part. The gating function consists of two reward functions and a visual candidate frame. It is only activated when both incentives are equal to 1 and the confidence of the prediction frame is 1.

5. Model training

The entire training process is end-to-end training, using four training sets: RefCOCO, RefCOCO+, RefCOCOg, and ReferReasoning as indicators for model training and evaluation. The batch size is set to 8, initially The learning rate is set to 1e-4. The model is trained on 8*TitanX GPU for 100 generations. The learning rate of training is halved every 10 epochs, and the Adam method is used for gradient descent.

6. Model application

After passing the above training process, you can get multiple models by saving each step. Select the best model (with the best test effect on the test set) for application. For the input images and sentences, you only need to adjust the image to 256x256 size. , and normalized, the sentence is segmented and can be used as the input of the model. The parameters of the entire network model are fixed, as long as the image data and language data are input and propagated forward. The model can automatically perform dynamic reasoning based on the implicit coupling correlation between text and images, and finally obtain appropriate prediction results after reasoning. The actual experimental pictures are shown in Figures 2 and 3. The text corresponding to the leftmost image in Figure 2 is "A gray laptop computer placed on the desktop with the page open", and the text corresponding to the middle image is "Bear in the woods and A cub on a rock and a cub climbing a tree." The text corresponding to the image on the right is "Sandwich with lettuce hanging on the upper left corner of the bread." The experimental results show the detection and positioning of referent targets based on dynamic adaptive reasoning. It can efficiently give the accurate position of the sentence description information in the image.

Claims

A reference target detection and positioning method based on dynamic adaptive reasoning, which is characterized by including the following steps:

Step 1: Encoding features of text and image information;

Step 1-1: The image is encoded by Darknet-53 convolutional neural network to obtain 256*256*3 dimensions, and the entire image feature vector is recorded as V. Where W and H are the width and height of the image respectively, v k refers to the k-th image block area of the entire image feature V;

Step 1-2: Use the BERT pre-trained model to encode text features;

For a textual language description consisting of N words, after encoding, it becomes e n ∈R d , where n represents the word at the nth position in the sentence, e n is the word vector of each word, and d is the dimension of the word vector;

Step 2: Multi-modal feature fusion based on attention mechanism;

Step 2-1: Input E and V into the multi-modal feature fusion module based on the attention mechanism;

The multi-modal feature fusion module based on the attention mechanism includes a language text attention module under visual control and a visual fusion feature enhancement module under text control; during the t-th reasoning, after updating the multi-modal feature fusion module, the multi-modal feature fusion module The modal feature fusion module outputs V t and

Step 2-2: For the language text attention module under visual control, use the attention mechanism to construct a weight score Calculate the score for each word and introduce the historical cumulative and The calculation is as follows:

Among them, i is the cumulative calculation of the first t-1 traversals, refers to the weight of the nth word in the tth round of reasoning, refers to the average pooling of the visual feature vector V t-1 output by the previous round t-1 round of inference, · refers to the dot product operation, and Different learning parameters for the model; and The value is in the range of 0-1;

Therefore, the word vector is updated after the tth round as:

Step 2-3: For the visual fusion feature enhancement module under text control, a multi-head self-attention module is used to fuse language and image features;

Use the basic structure of transformer and adopt 6-layer-8-head layer, as follows:

in, Refers to the text word vector after weight calculation, [:] refers to the concat operation; ConvBNReLU refers to the convolution, BatchNormalize and ReLU activation function operations; Resize refers to the change size operation;

Step 2-4: The final prediction output is t x , t y , t w , t h , conf, where t x , t y , t w , and t h are the position information of the prediction box in the image, and conf is the confidence of the model. ;

Step 3: Use a dynamic reward mechanism for reasoning;

Step 3-1: For different texts and images, a dynamic reward module is proposed to decide whether to continue reasoning based on the current status of the visual-text vector in round t;

The visual and text vectors in round t are calculated as follows:

Among them, action is the most likely action in actions_prob, which is to continue reasoning or stop reasoning. actions_prob is calculated based on the text vector and visual vector of this round of reasoning to predict the possibility of continuing reasoning; e cls refers to the header after BERT encoding vectorCLS;

Use two reinforcement learning reward mechanisms, namely final reward and immediate reward;

Step 3-2: The final reward is the reward value calculated based on the difference between the inference result of this round and the real box, that is, calculated based on the candidate box O generated in this round, and is defined as follows:

Among them, IoU refers to the difference between the candidate box O in the final reasoning of this round and the real training target box calculated during training;

Step 3-3: Instant Reward Calculate Reward Points for Correct Association:

Among them, Score t is the correlation score between the visual vector and the text description word vector calculated in the tth round. Finally, Represents whether the degree of multi-modal correlation is improved. If the t-th step of fusion reasoning leads to a gradual increase in positive influence, it is 1; otherwise, a penalty mechanism is generated;

Step 3-4: In order to globally train dynamic adaptive reasoning, the total score of the t-th round dynamic reward module is calculated as follows:

Use the reward weight weight t to be input to the attention module of the language text part of the next round of fused reasoning module for the next step of reasoning;

Step 3-5: Use CrossEntropyLoss as the training loss, which is obtained by calculating the difference between the predicted frame and the real frame of each area in the image; the dynamic reward module determines whether to continue inference based on the visual features after inference, and when the final reward and immediate reward are met Inference stops when all are forward activations and the confidence of the prediction box is 1.
A referential target detection and positioning method based on dynamic adaptive reasoning according to claim 1, characterized by That is, the d is 768 dimensions.