CN115661842A - Dynamic adaptive inference-based nominal target detection and positioning method - Google Patents
Dynamic adaptive inference-based nominal target detection and positioning method Download PDFInfo
- Publication number
- CN115661842A CN115661842A CN202211286108.2A CN202211286108A CN115661842A CN 115661842 A CN115661842 A CN 115661842A CN 202211286108 A CN202211286108 A CN 202211286108A CN 115661842 A CN115661842 A CN 115661842A
- Authority
- CN
- China
- Prior art keywords
- reasoning
- text
- reward
- visual
- round
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000001514 detection method Methods 0.000 title claims abstract description 11
- 230000003044 adaptive effect Effects 0.000 title claims description 6
- 230000004927 fusion Effects 0.000 claims abstract description 43
- 230000007246 mechanism Effects 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 6
- 230000002787 reinforcement Effects 0.000 claims abstract description 6
- 230000006872 improvement Effects 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 46
- 230000000007 visual effect Effects 0.000 claims description 45
- 238000004364 calculation method Methods 0.000 claims description 14
- 230000009471 action Effects 0.000 claims description 10
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 238000005728 strengthening Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 230000008859 change Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 5
- 230000008878 coupling Effects 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 4
- 238000005859 coupling reaction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000208822 Lactuca Species 0.000 description 1
- 235000003228 Lactuca sativa Nutrition 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 235000008429 bread Nutrition 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000011435 rock Substances 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000004936 stimulating effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a named target detection and positioning method based on dynamic self-adaptive reasoning, which is characterized in that images are respectively extracted by adopting a DarkNet pre-training model based on a convolutional neural network and texts are respectively extracted by adopting a BERT pre-training model, the image and text information is subjected to feature fusion by utilizing a multi-mode fusion attention mechanism, finally, the dynamic self-adaptive reasoning is carried out by utilizing a reinforcement learning reward mechanism algorithm, and the position of a pointed target in the images is detected and positioned. The invention obtains higher accuracy and running speed, and has outstanding improvement on precision and speed compared with the prior model.
Description
Technical Field
The invention belongs to the technical field of multi-modal visual languages, and particularly relates to a method for detecting and positioning a designated target.
Background
The reference target detection positioning refers to a method for positioning a target area in an image based on natural language description. That is, for an image and a corresponding text language description, it is desirable that the machine can perform fusion reasoning based on the multi-modal information of the language and the image to automatically determine the target region in the image corresponding to the text language description. The machine needs to comprehensively understand complex natural language semantic information and visual scene information, and deeply excavates the multi-modal implicit semantic coupling relation through multi-step reasoning. The method is one of basic researches for realizing machine intelligence in artificial intelligence, and has wide application scenes, for example, the robot can automatically navigate, a household robot can automatically search and position a target area in a visual scene according to text description information such as commands, and other operations can be executed on the basis. The method can also be applied to other visual language multi-modal tasks such as visual question answering and visual dialogue. Therefore, the target detection and positioning is a very important basic link in realizing machine intelligence, and has great practical and commercial value, and has attracted attention of many people in academia and industry in recent years.
In the early stage, a two-stage method is mostly adopted for target detection and positioning, a group of candidate regions are extracted by depending on a target detector, and then a target region with the highest probability is selected from the candidate regions to serve as a final answer. Later it was discovered that both phase methods were limited by the first phase, and if the target could not be identified in the first phase, then the second phase was invalid. In addition, in time complexity, this results in a candidate region with a large number of redundant feature calculations, making the calculation cost quite large. In recent years, researchers propose to directly extract global features of images by using a one-stage method, and then perform multi-step fusion and reasoning according to text information to determine specific region positions in the images. However, since the text language description is different in length, it can be a word, a phrase or even a long text, so that it is not fixed that multi-step reasoning is required. In fact, the implicit text-image coupling strength is different, the required inference steps are different, some complex inferences need more than 10+ inference steps, and some are simpler and only need 3-5 steps. The existing one-stage method adopts fixed inference steps, which leads to the redundancy of inference steps and the increase of time complexity in the face of short texts. In the case of long texts, the number of inference steps is insufficient, and the final target area cannot be determined sufficiently, so that the result is wrong.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a named target detection and positioning method based on dynamic self-adaptive reasoning, which is characterized in that a DarkNet pre-training model based on a convolutional neural network is adopted for images, a BERT pre-training model is adopted for texts to respectively extract pictures and language representations, a multi-mode fusion attention mechanism is utilized to perform feature fusion on image and text information, finally, a reinforcement learning reward mechanism algorithm is utilized to perform dynamic self-adaptive reasoning, and the position of a pointed target in the images is detected and positioned. The invention obtains higher accuracy and running speed, and has outstanding improvement on precision and speed compared with the prior model.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: encoding characteristics of text and image information;
step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,where W and H are the width and height of the image, respectively, v k The k-th image block area of the whole image characteristic V is referred to;
step 1-2: performing text feature coding by using a BERT pre-training model;
for the text language description composed of N words, the text language description is changed into a text language description after being codede n ∈R d Where n represents the word at the nth position in the sentence, e n Word vector for each wordD is the dimension of the word vector;
step 2: multi-modal feature fusion based on attention mechanism;
step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;
the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the visual control; during the t-th reasoning, after the multi-modal feature fusion module is updated, the multi-modal feature fusion module respectively outputs V t And
step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanismFor each word, and introducing historical cumulative computation scores Andthe calculation is as follows:
wherein i is the cumulative calculation of the previous t-1 times,refers to the weight of the nth word under the t round of reasoning,means that the previous t-1 round of reasoningOutput visual feature vector V t-1 The average pooling of (a) is a dot product operation,anddifferent learning parameters for the model;andthe value of (A) is in the range of 0-1;
the word vector is updated through the t-th round to:
step 2-3: for a visual fusion feature enhancement module under text control, a multi-head self-attention module is adopted to fuse the speech and image features;
the transform basic structure is used, and a 6-layer-8-head layer is adopted, and the details are as follows:
wherein the content of the first and second substances,is the text word vector after the weight calculation, [:]refers to concat operation; convBNReLU refers to convolution, batchNormalize, and ReLU activation function operations; resize refers to a change size operation;
step 2-4: the final predicted output is t x ,t y ,t w ,t h Conf, where t x ,t y ,t w ,t h Position information of a prediction frame in the image is obtained, and conf is the confidence level of the model;
and 3, step 3: reasoning by adopting a dynamic reward mechanism;
step 3-1: aiming at different texts and images, a dynamic reward module is provided, and whether inference is continued or not is determined according to the current situation of a visual-text vector in the tth round;
the visual and text vectors in the t-th round are calculated as follows:
the actions are the actions with the highest possibility in the actions _ prob, the inference is continued or stopped, and the actions _ prob is the possibility of predicting the continued inference obtained by calculating the text vector and the visual vector of the current round of inference; e.g. of the type cls Refers to the head vector CLS after BERT coding;
two reinforcement learning reward mechanisms are used, namely final reward and instant reward;
step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:
wherein, the IoU is the difference value between the candidate frame O and the real training target frame in the final estimation of the round of calculation during training;
step 3-3: the instant rewards calculate the reward points under the correct associations:
wherein, score t Calculating the association score between the visual vector and the text descriptor vector under the t-th round, and finallyRepresenting whether the correlation degree of the multiple modes is improved, and if the forward influence is gradually improved due to the fusion inference in the t step, the correlation degree is 1; otherwise, a punishment mechanism is generated;
step 3-4: for global training of dynamic adaptive reasoning, the total score of the t round dynamic reward module is calculated as follows:
using rewarding weight t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;
step 3-5: taking Cross EntropyLoss as training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are both positively activated and the self-credibility of the prediction box is 1.
Preferably, d is 768 dimensions.
The invention has the following beneficial effects:
the invention realizes the detection and positioning of the designated target by using an innovative and efficient dynamic self-adaptive reasoning method. Different from the previous model, the model dynamically and continuously fuses the inference prediction bounding box by directly utilizing the image and the language information, does not need to generate a series of candidate boxes for the picture in two stages, and solves the problems of insufficient inference or calculation redundancy caused by the fact that fixed inference steps are needed in the existing one-stage method, so that higher accuracy and higher operation speed are obtained. The experimental result shows that the model architecture of the invention has outstanding improvement in precision and speed compared with the prior model.
Drawings
FIG. 1 is a diagram showing the structure of the method of the present invention.
FIG. 2 is a diagram of the actual test results of three different pictures according to the embodiment of the present invention, wherein 1 represents a real box, 2 represents the results obtained by the method of the present invention, and 3 represents the results obtained by other optimal methods.
FIG. 3 is a diagram illustrating the distribution of image attention in different inference steps according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in FIG. 1, the invention provides a method for detecting and positioning a nominal target based on dynamic adaptive reasoning, which can dynamically determine the reasoning step number according to the text and image characteristics, and has higher speed and higher accuracy.
The technical scheme of the invention is as follows: the system comprises four parts, wherein the first part is a feature coding process for text and image information, the second part is a multi-mode feature fusion process based on an attention mechanism, and the third part is a dynamic reasoning process for automatically determining the step number. In the first part, pre-training Darknet-53 based on a convolutional neural network is adopted to perform characteristic coding on picture information, and a BERT pre-training model is adopted to perform characteristic coding on text information. In the second part, a multi-mode fusion reasoning mechanism based on an attention mechanism is adopted to reason different words in the text information for the visual characteristics respectively, so that the weight of key information words is enhanced, and the target area characteristics of the visual information are enhanced by the text information. In the third part, a dynamic reward mechanism is proposed by utilizing reinforcement learning to control whether each step of reasoning is correct and dynamically judge whether the reasoning is sufficient or not, if not, the second part of iterative reasoning is continued, and if the image and text characteristics are sufficiently fused and reasoned to obtain a correct answer, the reasoning is stopped.
A nominal target detection positioning method based on dynamic self-adaptive reasoning comprises the following steps:
step 1: encoding characteristics of text and image information;
step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,where W and H are the width and height of the image, respectively, v k The k-th image block area of the whole image characteristic V is referred to;
step 1-2: performing text feature coding by using a BERT pre-training model;
for the text language description composed of N words, the text language description is coded to becomee n ∈R d Where n represents the word at the nth position in the sentence, e n A word vector for each word, d being the dimension of the word vector;
step 2: multi-modal feature fusion based on attention mechanism;
step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;
the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the text control; the multi-mode feature fusion module is fused with the reasoning steps, each reasoning is fusion reasoning of two parts of the updated feature fusion module, and the multi-mode feature fusion module respectively outputs V after updating the multi-mode feature fusion module in the t-th reasoning t And
step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanismGiving each word and introducing historical cumulative computation scoresTo avoid the model forgetting the historical inference score problem,andthe calculation is as follows:
wherein i is the cumulative calculation of the previous t-1 times,refers to the weight of the nth word under the t round of reasoning,refers to the visual feature vector V output by the last t-1 round of reasoning t-1 The average pooling of (a) is a dot product operation,anddifferent learning parameters for the model;andthe value of (a) is in the range of 0 to 1;
the word vector is updated through the t-th round to:
step 2-3: for a visual fusion feature strengthening module under text control, in order to establish the relationship between languages and images more deeply, a multi-head self-attention module is adopted to fuse the language and image features;
updating a visual vector by using a transform basic structure and adopting a 6-layer-8-head layer through the fusion control of text vectors; by the method, the visual vector output by each round of reasoning has strong coupling association and text information, so that the effectiveness of reasoning is ensured; the method comprises the following specific steps:
wherein the content of the first and second substances,is the text word vector after the weight calculation, [:]refers to concat operation;
step 2-4: the final predicted output is t x ,t y ,t w ,t h Conf, wherein t x ,t y ,t w ,t h Position information of a prediction frame in the image, conf is the self-reliability of the model; splicing the features obtained from the cross-modal attention module, and only depending on the output V of the visual part after the cross-modal fusion t ;
And step 3: reasoning by adopting a dynamic reward mechanism;
step 3-1: the above steps of fusion reasoning are iterative steps. Aiming at different texts and images, a dynamic reward module is provided, and whether to continue reasoning is determined according to the current situation of the visual-text vector in the t-th round;
the visual and text vectors in the t-th round are calculated as follows:
wherein, action decides whether to continue reasoning, and action _ prob is to calculate and obtain the possibility of predicting to continue reasoning according to the text vector and the visual vector of the current round of reasoning; e.g. of the type cls Refers to the head vector CLS after BERT coding;
two reinforcement learning reward mechanisms are used, namely final reward and instant reward;
step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:
the IoU is a difference value between a candidate frame O in the final inference of the round and a real training target frame (because a data set does not have fixed inference steps of each sample) calculated during training, and the IoU is fixed to be 1 because the real frame value is 0 during testing;
step 3-3: the instant reward is used for stimulating the positive influence in the process of training reasoning, and the fusion module is used for enabling each word weight to be more closely associated with the characteristics of different visual regions. Thus, the instant prize calculates the prize score under the correct association:
wherein, score t Calculating the association score between the visual vector and the text descriptor vector under the t-th round, and finallyRepresenting whether the relevance degree of the multiple modes is improved, if the forward influence of gradual improvement is caused by the fusion reasoning in the t step, the relevance degree is 1; otherwise, a penalty machine is generatedPreparing;
step 3-4: for global training of dynamic adaptive reasoning, the total score of the dynamic reward module in the t round is calculated as follows:
using the reward weight t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;
step 3-5: taking Cross EntropyLoss as a training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are positively activated and the confidence level of the prediction box is 1.
The specific embodiment is as follows:
1. image features
Given a picture in a natural scene, the whole picture resize is input into a pre-trained feature extraction network Darknet-53 to encode image features, and the size of the whole picture is 256 × 3.
2. Text features
The longest sentence word number is specified to be 20, and the word vector in the text description after position coding is input into a BERT network to obtain the feature vector of the fusion sentencee∈R d Where e is the characterization of each word, d is the word vector dimension 768 dimensions, and N is 20 at maximum.
3. Multimodal feature fusion enhancement with attention mechanism
The image features are expanded into 512 dimensions, the language features are expanded into (20 x 512) through an MLP network, the position codes are added, and then the image features and the language features are input into the multi-modal attention module together. The module consists of two parts, wherein the inference weight of different words in the text information is enhanced by the visual characteristics, and the regional characteristics of the text information on the visual information are enhanced. TextThe information part mainly carries out word weights w of different positions n And (3) calculating, wherein model calculation is carried out by utilizing the formula (1) of the attention module. The initial instant reward points, the point value is 1. And the visual information part utilizes the weighted text word vector to splice the language features and the image features after the feature enhancement of the previous stage and inputs the spliced language features and image features into the multi-head self-attention layer. The number of the multi-head self-attention layers is 6, and the number of the attention heads is 8.
4. Dynamic reasoning steps under reward mechanism
Under the condition of obtaining the fusion characteristics, selecting a visual characteristic part and a weighted text part, wherein an instant reward mechanism and a final reward mechanism calculate a weight score and return the weight score to a text information weight reasoning part. The gating function is composed of two reward functions and a visual candidate box, and is activated only when the two incentives are equal to 1 and the confidence of the prediction box is 1.
5. Model training
The whole training process is end-to-end training, and four training sets of RefCOCO, refCOCO +, refCOCOCOCOG and Referreasoning are adopted as indexes of model training and evaluation. The batch size is set to 8 and the initial learning rate is set to 1e-4. The model was trained on 8 × titanx GPU for 100 generations, with half the learning rate per 10 epoch trains and gradient descent using Adam's method.
6. Model application
After the training process, a plurality of models can be obtained through each step of storage, the optimal model (the best test effect on the test set) is selected for application, and for the input images and sentences, the images are adjusted to 256x256 size and normalized, and the sentences are subjected to word segmentation operation and can be used as the input of the models. The parameters of the whole network model are fixed, and only image data and language data are input and propagated forwards. The model can automatically carry out dynamic reasoning according to the implicit coupling association degree of the text and the image, and finally obtains a prediction result after proper reasoning. The actual experimental diagrams are shown in fig. 2 and fig. 3, the characters corresponding to the leftmost image in fig. 2 are "a gray laptop notebook computer placed on a desktop and with a page opened", the characters corresponding to the middle image are "a bear in a forest, a young bear on a rock and a young bear on a tree climbing", the characters corresponding to the right image are "a sandwich with lettuce hung on the upper left corner of a bread", and the experimental result shows that the accurate position of the description information of the relevant sentence in the image can be efficiently given based on the nominal target detection positioning of the dynamic adaptive inference.
Claims (2)
1. A nominal target detection positioning method based on dynamic self-adaptive reasoning is characterized by comprising the following steps:
step 1: encoding characteristics of text and image information;
step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,where W and H are the width and height of the image, respectively, v k The k-th image block area of the whole image characteristic V is referred to;
step 1-2: performing text feature coding by using a BERT pre-training model;
for the text language description composed of N words, the text language description is coded to becomee n ∈R d Where n represents the word at the nth position in the sentence, e n A word vector for each word, d being the dimension of the word vector;
step 2: multi-modal feature fusion based on attention mechanism;
step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;
the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the text control; during the t-th reasoning, after the multi-modal feature fusion module is updated, the multi-modal feature fusionThe modules respectively output V t And
step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanismFor each word, and introducing historical cumulative computation scoresAndthe calculation is as follows:
wherein i is the cumulative calculation of the previous t-1 times,refers to the weight of the nth word under the t round of reasoning,refers to the visual characteristic vector V output by the last t-1 round of reasoning t-1 The average pooling of (a) is a dot product operation,anddifferent learning parameters for the model;andthe value of (A) is in the range of 0-1;
the word vector is updated through the t-th round to:
step 2-3: for a visual fusion feature strengthening module under text control, a multi-head self-attention module is adopted to fuse the language and image features;
the transform basic structure is used, and a 6-layer-8-head layer is adopted, and the details are as follows:
wherein the content of the first and second substances,is the text word vector after the weight calculation, [:]refers to concat operation; convBNReLU refers to convolution, batchNormalize, and ReLU activation function operations; resize refers to a change size operation;
step 2-4: the final predicted output is t x ,t y ,t w ,t h Conf, wherein t x ,t y ,t w ,t h Position information of a prediction frame in the image is obtained, and conf is the confidence level of the model;
and step 3: reasoning by adopting a dynamic reward mechanism;
step 3-1: aiming at different texts and images, a dynamic reward module is provided, and whether to continue reasoning is determined according to the current situation of the visual-text vector in the t-th round;
the visual and text vectors in the t-th round are calculated as follows:
wherein, action is the action with the highest possibility in action _ prob, and continuous reasoning or and stopping reasoning are taken, and action _ prob is the possibility of predicting continuous reasoning obtained by calculation according to the text vector and the visual vector of the current round of reasoning; e.g. of a cylinder cls Refers to the head vector CLS after BERT coding;
two reinforcement learning reward mechanisms are used, namely final reward and instant reward;
step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:
wherein, the IoU is the difference between a candidate frame O in the final estimation of the round and a real training target frame calculated during training;
step 3-3: the instant rewards calculate the reward points under the correct associations:
wherein, score t Calculating the association score between the visual vector and the text descriptor vector in the t-th round and finallyRepresenting whether the relevance degree of the multiple modes is improved, if the forward influence of gradual improvement is caused by the fusion reasoning in the t step, the relevance degree is 1; otherwise, a penalty mechanism is generated;
step 3-4: for global training of dynamic adaptive reasoning, the total score of the dynamic reward module in the t round is calculated as follows:
using rewarding weight t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;
step 3-5: taking Cross EntropyLoss as training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are both positively activated and the self-credibility of the prediction box is 1.
2. The method of claim 1, wherein d is 768 dimensions.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211286108.2A CN115661842A (en) | 2022-10-20 | 2022-10-20 | Dynamic adaptive inference-based nominal target detection and positioning method |
PCT/CN2023/123906 WO2024037664A1 (en) | 2022-10-20 | 2023-10-11 | Referring target detection and positioning method based on dynamic adaptive reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211286108.2A CN115661842A (en) | 2022-10-20 | 2022-10-20 | Dynamic adaptive inference-based nominal target detection and positioning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115661842A true CN115661842A (en) | 2023-01-31 |
Family
ID=84989042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211286108.2A Pending CN115661842A (en) | 2022-10-20 | 2022-10-20 | Dynamic adaptive inference-based nominal target detection and positioning method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115661842A (en) |
WO (1) | WO2024037664A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117196546A (en) * | 2023-11-08 | 2023-12-08 | 杭州实在智能科技有限公司 | RPA flow executing system and method based on page state understanding and large model driving |
WO2024037664A1 (en) * | 2022-10-20 | 2024-02-22 | 西北工业大学 | Referring target detection and positioning method based on dynamic adaptive reasoning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114241191A (en) * | 2021-12-19 | 2022-03-25 | 西北工业大学 | Cross-modal self-attention-based non-candidate-box expression understanding method |
CN115062174A (en) * | 2022-06-16 | 2022-09-16 | 电子科技大学 | End-to-end image subtitle generating method based on semantic prototype tree |
CN115661842A (en) * | 2022-10-20 | 2023-01-31 | 西北工业大学 | Dynamic adaptive inference-based nominal target detection and positioning method |
-
2022
- 2022-10-20 CN CN202211286108.2A patent/CN115661842A/en active Pending
-
2023
- 2023-10-11 WO PCT/CN2023/123906 patent/WO2024037664A1/en unknown
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024037664A1 (en) * | 2022-10-20 | 2024-02-22 | 西北工业大学 | Referring target detection and positioning method based on dynamic adaptive reasoning |
CN117196546A (en) * | 2023-11-08 | 2023-12-08 | 杭州实在智能科技有限公司 | RPA flow executing system and method based on page state understanding and large model driving |
Also Published As
Publication number | Publication date |
---|---|
WO2024037664A1 (en) | 2024-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299262B (en) | Text inclusion relation recognition method fusing multi-granularity information | |
CN109992648B (en) | Deep text matching method and device based on word migration learning | |
KR102532749B1 (en) | Method and apparatus for hierarchical learning of neural networks based on weak supervised learning | |
CN108415977B (en) | Deep neural network and reinforcement learning-based generative machine reading understanding method | |
CN115661842A (en) | Dynamic adaptive inference-based nominal target detection and positioning method | |
CN110188358B (en) | Training method and device for natural language processing model | |
CN109947912A (en) | A kind of model method based on paragraph internal reasoning and combined problem answer matches | |
CN110334354A (en) | A kind of Chinese Relation abstracting method | |
CN109543820B (en) | Image description generation method based on architecture phrase constraint vector and double vision attention mechanism | |
CN110969020A (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN113656570A (en) | Visual question answering method and device based on deep learning model, medium and equipment | |
CN112699682A (en) | Named entity identification method and device based on combinable weak authenticator | |
CN110399454B (en) | Text coding representation method based on transformer model and multiple reference systems | |
CN115080715B (en) | Span extraction reading understanding method based on residual structure and bidirectional fusion attention | |
CN116110022B (en) | Lightweight traffic sign detection method and system based on response knowledge distillation | |
CN113204675B (en) | Cross-modal video time retrieval method based on cross-modal object inference network | |
CN115186147B (en) | Dialogue content generation method and device, storage medium and terminal | |
CN114241191A (en) | Cross-modal self-attention-based non-candidate-box expression understanding method | |
CN116564355A (en) | Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion | |
CN112560948A (en) | Eye fundus map classification method and imaging method under data deviation | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN116186250A (en) | Multi-mode learning level mining method, system and medium under small sample condition | |
CN110991515A (en) | Image description method fusing visual context | |
CN117034961B (en) | BERT-based medium-method inter-translation quality assessment method | |
CN111582287B (en) | Image description method based on sufficient visual information and text information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Zhang Yanning Inventor after: Wang Peng Inventor after: Zhang Zhipeng Inventor after: Wei Zhimin Inventor before: Wang Peng Inventor before: Zhang Zhipeng Inventor before: Wei Zhimin Inventor before: Zhang Yanning |
|
CB03 | Change of inventor or designer information |