CN115661842A

CN115661842A - Dynamic adaptive inference-based nominal target detection and positioning method

Info

Publication number: CN115661842A
Application number: CN202211286108.2A
Authority: CN
Inventors: 王鹏; 张志鹏; 魏至民; 张艳宁
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-01-31
Also published as: WO2024037664A1

Abstract

The invention discloses a named target detection and positioning method based on dynamic self-adaptive reasoning, which is characterized in that images are respectively extracted by adopting a DarkNet pre-training model based on a convolutional neural network and texts are respectively extracted by adopting a BERT pre-training model, the image and text information is subjected to feature fusion by utilizing a multi-mode fusion attention mechanism, finally, the dynamic self-adaptive reasoning is carried out by utilizing a reinforcement learning reward mechanism algorithm, and the position of a pointed target in the images is detected and positioned. The invention obtains higher accuracy and running speed, and has outstanding improvement on precision and speed compared with the prior model.

Description

Dynamic self-adaptive reasoning-based nominal target detection positioning method

Technical Field

The invention belongs to the technical field of multi-modal visual languages, and particularly relates to a method for detecting and positioning a designated target.

Background

The reference target detection positioning refers to a method for positioning a target area in an image based on natural language description. That is, for an image and a corresponding text language description, it is desirable that the machine can perform fusion reasoning based on the multi-modal information of the language and the image to automatically determine the target region in the image corresponding to the text language description. The machine needs to comprehensively understand complex natural language semantic information and visual scene information, and deeply excavates the multi-modal implicit semantic coupling relation through multi-step reasoning. The method is one of basic researches for realizing machine intelligence in artificial intelligence, and has wide application scenes, for example, the robot can automatically navigate, a household robot can automatically search and position a target area in a visual scene according to text description information such as commands, and other operations can be executed on the basis. The method can also be applied to other visual language multi-modal tasks such as visual question answering and visual dialogue. Therefore, the target detection and positioning is a very important basic link in realizing machine intelligence, and has great practical and commercial value, and has attracted attention of many people in academia and industry in recent years.

In the early stage, a two-stage method is mostly adopted for target detection and positioning, a group of candidate regions are extracted by depending on a target detector, and then a target region with the highest probability is selected from the candidate regions to serve as a final answer. Later it was discovered that both phase methods were limited by the first phase, and if the target could not be identified in the first phase, then the second phase was invalid. In addition, in time complexity, this results in a candidate region with a large number of redundant feature calculations, making the calculation cost quite large. In recent years, researchers propose to directly extract global features of images by using a one-stage method, and then perform multi-step fusion and reasoning according to text information to determine specific region positions in the images. However, since the text language description is different in length, it can be a word, a phrase or even a long text, so that it is not fixed that multi-step reasoning is required. In fact, the implicit text-image coupling strength is different, the required inference steps are different, some complex inferences need more than 10+ inference steps, and some are simpler and only need 3-5 steps. The existing one-stage method adopts fixed inference steps, which leads to the redundancy of inference steps and the increase of time complexity in the face of short texts. In the case of long texts, the number of inference steps is insufficient, and the final target area cannot be determined sufficiently, so that the result is wrong.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a named target detection and positioning method based on dynamic self-adaptive reasoning, which is characterized in that a DarkNet pre-training model based on a convolutional neural network is adopted for images, a BERT pre-training model is adopted for texts to respectively extract pictures and language representations, a multi-mode fusion attention mechanism is utilized to perform feature fusion on image and text information, finally, a reinforcement learning reward mechanism algorithm is utilized to perform dynamic self-adaptive reasoning, and the position of a pointed target in the images is detected and positioned. The invention obtains higher accuracy and running speed, and has outstanding improvement on precision and speed compared with the prior model.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: encoding characteristics of text and image information;

step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,

where W and H are the width and height of the image, respectively, v _k The k-th image block area of the whole image characteristic V is referred to;

step 1-2: performing text feature coding by using a BERT pre-training model;

for the text language description composed of N words, the text language description is changed into a text language description after being coded

e _n ∈R ^d Where n represents the word at the nth position in the sentence, e _n Word vector for each wordD is the dimension of the word vector;

step 2: multi-modal feature fusion based on attention mechanism;

step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;

the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the visual control; during the t-th reasoning, after the multi-modal feature fusion module is updated, the multi-modal feature fusion module respectively outputs V ^t And

step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanism

For each word, and introducing historical cumulative computation scores

And

the calculation is as follows:

wherein i is the cumulative calculation of the previous t-1 times,

refers to the weight of the nth word under the t round of reasoning,

means that the previous t-1 round of reasoningOutput visual feature vector V ^t-1 The average pooling of (a) is a dot product operation,

and

different learning parameters for the model;

and

the value of (A) is in the range of 0-1;

the word vector is updated through the t-th round to:

step 2-3: for a visual fusion feature enhancement module under text control, a multi-head self-attention module is adopted to fuse the speech and image features;

the transform basic structure is used, and a 6-layer-8-head layer is adopted, and the details are as follows:

wherein the content of the first and second substances,

is the text word vector after the weight calculation, [:]refers to concat operation; convBNReLU refers to convolution, batchNormalize, and ReLU activation function operations; resize refers to a change size operation;

step 2-4: the final predicted output is t _x ,t _y ,t _w ,t _h Conf, where t _x ,t _y ,t _w ,t _h Position information of a prediction frame in the image is obtained, and conf is the confidence level of the model;

and 3, step 3: reasoning by adopting a dynamic reward mechanism;

step 3-1: aiming at different texts and images, a dynamic reward module is provided, and whether inference is continued or not is determined according to the current situation of a visual-text vector in the tth round;

the visual and text vectors in the t-th round are calculated as follows:

the actions are the actions with the highest possibility in the actions _ prob, the inference is continued or stopped, and the actions _ prob is the possibility of predicting the continued inference obtained by calculating the text vector and the visual vector of the current round of inference; e.g. of the type _cls Refers to the head vector CLS after BERT coding;

two reinforcement learning reward mechanisms are used, namely final reward and instant reward;

step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:

wherein, the IoU is the difference value between the candidate frame O and the real training target frame in the final estimation of the round of calculation during training;

step 3-3: the instant rewards calculate the reward points under the correct associations:

wherein, score ^t Calculating the association score between the visual vector and the text descriptor vector under the t-th round, and finally

Representing whether the correlation degree of the multiple modes is improved, and if the forward influence is gradually improved due to the fusion inference in the t step, the correlation degree is 1; otherwise, a punishment mechanism is generated;

step 3-4: for global training of dynamic adaptive reasoning, the total score of the t round dynamic reward module is calculated as follows:

using rewarding weight ^t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;

step 3-5: taking Cross EntropyLoss as training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are both positively activated and the self-credibility of the prediction box is 1.

Preferably, d is 768 dimensions.

The invention has the following beneficial effects:

the invention realizes the detection and positioning of the designated target by using an innovative and efficient dynamic self-adaptive reasoning method. Different from the previous model, the model dynamically and continuously fuses the inference prediction bounding box by directly utilizing the image and the language information, does not need to generate a series of candidate boxes for the picture in two stages, and solves the problems of insufficient inference or calculation redundancy caused by the fact that fixed inference steps are needed in the existing one-stage method, so that higher accuracy and higher operation speed are obtained. The experimental result shows that the model architecture of the invention has outstanding improvement in precision and speed compared with the prior model.

Drawings

FIG. 1 is a diagram showing the structure of the method of the present invention.

FIG. 2 is a diagram of the actual test results of three different pictures according to the embodiment of the present invention, wherein 1 represents a real box, 2 represents the results obtained by the method of the present invention, and 3 represents the results obtained by other optimal methods.

FIG. 3 is a diagram illustrating the distribution of image attention in different inference steps according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in FIG. 1, the invention provides a method for detecting and positioning a nominal target based on dynamic adaptive reasoning, which can dynamically determine the reasoning step number according to the text and image characteristics, and has higher speed and higher accuracy.

The technical scheme of the invention is as follows: the system comprises four parts, wherein the first part is a feature coding process for text and image information, the second part is a multi-mode feature fusion process based on an attention mechanism, and the third part is a dynamic reasoning process for automatically determining the step number. In the first part, pre-training Darknet-53 based on a convolutional neural network is adopted to perform characteristic coding on picture information, and a BERT pre-training model is adopted to perform characteristic coding on text information. In the second part, a multi-mode fusion reasoning mechanism based on an attention mechanism is adopted to reason different words in the text information for the visual characteristics respectively, so that the weight of key information words is enhanced, and the target area characteristics of the visual information are enhanced by the text information. In the third part, a dynamic reward mechanism is proposed by utilizing reinforcement learning to control whether each step of reasoning is correct and dynamically judge whether the reasoning is sufficient or not, if not, the second part of iterative reasoning is continued, and if the image and text characteristics are sufficiently fused and reasoned to obtain a correct answer, the reasoning is stopped.

A nominal target detection positioning method based on dynamic self-adaptive reasoning comprises the following steps:

step 1: encoding characteristics of text and image information;

step 1-2: performing text feature coding by using a BERT pre-training model;

for the text language description composed of N words, the text language description is coded to become

e _n ∈R ^d Where n represents the word at the nth position in the sentence, e _n A word vector for each word, d being the dimension of the word vector;

step 2: multi-modal feature fusion based on attention mechanism;

the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the text control; the multi-mode feature fusion module is fused with the reasoning steps, each reasoning is fusion reasoning of two parts of the updated feature fusion module, and the multi-mode feature fusion module respectively outputs V after updating the multi-mode feature fusion module in the t-th reasoning ^t And

Giving each word and introducing historical cumulative computation scores

To avoid the model forgetting the historical inference score problem,

and

the calculation is as follows:

wherein i is the cumulative calculation of the previous t-1 times,

refers to the weight of the nth word under the t round of reasoning,

refers to the visual feature vector V output by the last t-1 round of reasoning ^t-1 The average pooling of (a) is a dot product operation,

and

different learning parameters for the model;

and

the value of (a) is in the range of 0 to 1;

the word vector is updated through the t-th round to:

step 2-3: for a visual fusion feature strengthening module under text control, in order to establish the relationship between languages and images more deeply, a multi-head self-attention module is adopted to fuse the language and image features;

updating a visual vector by using a transform basic structure and adopting a 6-layer-8-head layer through the fusion control of text vectors; by the method, the visual vector output by each round of reasoning has strong coupling association and text information, so that the effectiveness of reasoning is ensured; the method comprises the following specific steps:

wherein the content of the first and second substances,

is the text word vector after the weight calculation, [:]refers to concat operation;

step 2-4: the final predicted output is t _x ,t _y ,t _w ,t _h Conf, wherein t _x ,t _y ,t _w ,t _h Position information of a prediction frame in the image, conf is the self-reliability of the model; splicing the features obtained from the cross-modal attention module, and only depending on the output V of the visual part after the cross-modal fusion ^t ；

And step 3: reasoning by adopting a dynamic reward mechanism;

step 3-1: the above steps of fusion reasoning are iterative steps. Aiming at different texts and images, a dynamic reward module is provided, and whether to continue reasoning is determined according to the current situation of the visual-text vector in the t-th round;

the visual and text vectors in the t-th round are calculated as follows:

wherein, action decides whether to continue reasoning, and action _ prob is to calculate and obtain the possibility of predicting to continue reasoning according to the text vector and the visual vector of the current round of reasoning; e.g. of the type _cls Refers to the head vector CLS after BERT coding;

the IoU is a difference value between a candidate frame O in the final inference of the round and a real training target frame (because a data set does not have fixed inference steps of each sample) calculated during training, and the IoU is fixed to be 1 because the real frame value is 0 during testing;

step 3-3: the instant reward is used for stimulating the positive influence in the process of training reasoning, and the fusion module is used for enabling each word weight to be more closely associated with the characteristics of different visual regions. Thus, the instant prize calculates the prize score under the correct association:

Representing whether the relevance degree of the multiple modes is improved, if the forward influence of gradual improvement is caused by the fusion reasoning in the t step, the relevance degree is 1; otherwise, a penalty machine is generatedPreparing;

step 3-4: for global training of dynamic adaptive reasoning, the total score of the dynamic reward module in the t round is calculated as follows:

using the reward weight ^t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;

step 3-5: taking Cross EntropyLoss as a training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are positively activated and the confidence level of the prediction box is 1.

The specific embodiment is as follows:

1. image features

Given a picture in a natural scene, the whole picture resize is input into a pre-trained feature extraction network Darknet-53 to encode image features, and the size of the whole picture is 256 × 3.

2. Text features

The longest sentence word number is specified to be 20, and the word vector in the text description after position coding is input into a BERT network to obtain the feature vector of the fusion sentence

e∈R ^d Where e is the characterization of each word, d is the word vector dimension 768 dimensions, and N is 20 at maximum.

3. Multimodal feature fusion enhancement with attention mechanism

The image features are expanded into 512 dimensions, the language features are expanded into (20 x 512) through an MLP network, the position codes are added, and then the image features and the language features are input into the multi-modal attention module together. The module consists of two parts, wherein the inference weight of different words in the text information is enhanced by the visual characteristics, and the regional characteristics of the text information on the visual information are enhanced. TextThe information part mainly carries out word weights w of different positions _n And (3) calculating, wherein model calculation is carried out by utilizing the formula (1) of the attention module. The initial instant reward points, the point value is 1. And the visual information part utilizes the weighted text word vector to splice the language features and the image features after the feature enhancement of the previous stage and inputs the spliced language features and image features into the multi-head self-attention layer. The number of the multi-head self-attention layers is 6, and the number of the attention heads is 8.

4. Dynamic reasoning steps under reward mechanism

Under the condition of obtaining the fusion characteristics, selecting a visual characteristic part and a weighted text part, wherein an instant reward mechanism and a final reward mechanism calculate a weight score and return the weight score to a text information weight reasoning part. The gating function is composed of two reward functions and a visual candidate box, and is activated only when the two incentives are equal to 1 and the confidence of the prediction box is 1.

5. Model training

The whole training process is end-to-end training, and four training sets of RefCOCO, refCOCO +, refCOCOCOCOG and Referreasoning are adopted as indexes of model training and evaluation. The batch size is set to 8 and the initial learning rate is set to 1e-4. The model was trained on 8 × titanx GPU for 100 generations, with half the learning rate per 10 epoch trains and gradient descent using Adam's method.

6. Model application

After the training process, a plurality of models can be obtained through each step of storage, the optimal model (the best test effect on the test set) is selected for application, and for the input images and sentences, the images are adjusted to 256x256 size and normalized, and the sentences are subjected to word segmentation operation and can be used as the input of the models. The parameters of the whole network model are fixed, and only image data and language data are input and propagated forwards. The model can automatically carry out dynamic reasoning according to the implicit coupling association degree of the text and the image, and finally obtains a prediction result after proper reasoning. The actual experimental diagrams are shown in fig. 2 and fig. 3, the characters corresponding to the leftmost image in fig. 2 are "a gray laptop notebook computer placed on a desktop and with a page opened", the characters corresponding to the middle image are "a bear in a forest, a young bear on a rock and a young bear on a tree climbing", the characters corresponding to the right image are "a sandwich with lettuce hung on the upper left corner of a bread", and the experimental result shows that the accurate position of the description information of the relevant sentence in the image can be efficiently given based on the nominal target detection positioning of the dynamic adaptive inference.

Claims

1. A nominal target detection positioning method based on dynamic self-adaptive reasoning is characterized by comprising the following steps:

step 1: encoding characteristics of text and image information;

step 1-2: performing text feature coding by using a BERT pre-training model;

step 2: multi-modal feature fusion based on attention mechanism;

the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the text control; during the t-th reasoning, after the multi-modal feature fusion module is updated, the multi-modal feature fusionThe modules respectively output V ^t And

For each word, and introducing historical cumulative computation scores

And

the calculation is as follows:

wherein i is the cumulative calculation of the previous t-1 times,

refers to the weight of the nth word under the t round of reasoning,

refers to the visual characteristic vector V output by the last t-1 round of reasoning ^t-1 The average pooling of (a) is a dot product operation,

and

different learning parameters for the model;

and

the value of (A) is in the range of 0-1;

the word vector is updated through the t-th round to:

step 2-3: for a visual fusion feature strengthening module under text control, a multi-head self-attention module is adopted to fuse the language and image features;

wherein the content of the first and second substances,

step 2-4: the final predicted output is t _x ,t _y ,t _w ,t _h Conf, wherein t _x ,t _y ,t _w ,t _h Position information of a prediction frame in the image is obtained, and conf is the confidence level of the model;

and step 3: reasoning by adopting a dynamic reward mechanism;

step 3-1: aiming at different texts and images, a dynamic reward module is provided, and whether to continue reasoning is determined according to the current situation of the visual-text vector in the t-th round;

the visual and text vectors in the t-th round are calculated as follows:

wherein, action is the action with the highest possibility in action _ prob, and continuous reasoning or and stopping reasoning are taken, and action _ prob is the possibility of predicting continuous reasoning obtained by calculation according to the text vector and the visual vector of the current round of reasoning; e.g. of a cylinder _cls Refers to the head vector CLS after BERT coding;

wherein, the IoU is the difference between a candidate frame O in the final estimation of the round and a real training target frame calculated during training;

wherein, score ^t Calculating the association score between the visual vector and the text descriptor vector in the t-th round and finally

Representing whether the relevance degree of the multiple modes is improved, if the forward influence of gradual improvement is caused by the fusion reasoning in the t step, the relevance degree is 1; otherwise, a penalty mechanism is generated;

2. The method of claim 1, wherein d is 768 dimensions.