CN115661842A - Dynamic adaptive inference-based nominal target detection and positioning method - Google Patents

Dynamic adaptive inference-based nominal target detection and positioning method Download PDF

Info

Publication number
CN115661842A
CN115661842A CN202211286108.2A CN202211286108A CN115661842A CN 115661842 A CN115661842 A CN 115661842A CN 202211286108 A CN202211286108 A CN 202211286108A CN 115661842 A CN115661842 A CN 115661842A
Authority
CN
China
Prior art keywords
reasoning
text
reward
visual
round
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211286108.2A
Other languages
Chinese (zh)
Inventor
王鹏
张志鹏
魏至民
张艳宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202211286108.2A priority Critical patent/CN115661842A/en
Publication of CN115661842A publication Critical patent/CN115661842A/en
Priority to PCT/CN2023/123906 priority patent/WO2024037664A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a named target detection and positioning method based on dynamic self-adaptive reasoning, which is characterized in that images are respectively extracted by adopting a DarkNet pre-training model based on a convolutional neural network and texts are respectively extracted by adopting a BERT pre-training model, the image and text information is subjected to feature fusion by utilizing a multi-mode fusion attention mechanism, finally, the dynamic self-adaptive reasoning is carried out by utilizing a reinforcement learning reward mechanism algorithm, and the position of a pointed target in the images is detected and positioned. The invention obtains higher accuracy and running speed, and has outstanding improvement on precision and speed compared with the prior model.

Description

Dynamic self-adaptive reasoning-based nominal target detection positioning method
Technical Field
The invention belongs to the technical field of multi-modal visual languages, and particularly relates to a method for detecting and positioning a designated target.
Background
The reference target detection positioning refers to a method for positioning a target area in an image based on natural language description. That is, for an image and a corresponding text language description, it is desirable that the machine can perform fusion reasoning based on the multi-modal information of the language and the image to automatically determine the target region in the image corresponding to the text language description. The machine needs to comprehensively understand complex natural language semantic information and visual scene information, and deeply excavates the multi-modal implicit semantic coupling relation through multi-step reasoning. The method is one of basic researches for realizing machine intelligence in artificial intelligence, and has wide application scenes, for example, the robot can automatically navigate, a household robot can automatically search and position a target area in a visual scene according to text description information such as commands, and other operations can be executed on the basis. The method can also be applied to other visual language multi-modal tasks such as visual question answering and visual dialogue. Therefore, the target detection and positioning is a very important basic link in realizing machine intelligence, and has great practical and commercial value, and has attracted attention of many people in academia and industry in recent years.
In the early stage, a two-stage method is mostly adopted for target detection and positioning, a group of candidate regions are extracted by depending on a target detector, and then a target region with the highest probability is selected from the candidate regions to serve as a final answer. Later it was discovered that both phase methods were limited by the first phase, and if the target could not be identified in the first phase, then the second phase was invalid. In addition, in time complexity, this results in a candidate region with a large number of redundant feature calculations, making the calculation cost quite large. In recent years, researchers propose to directly extract global features of images by using a one-stage method, and then perform multi-step fusion and reasoning according to text information to determine specific region positions in the images. However, since the text language description is different in length, it can be a word, a phrase or even a long text, so that it is not fixed that multi-step reasoning is required. In fact, the implicit text-image coupling strength is different, the required inference steps are different, some complex inferences need more than 10+ inference steps, and some are simpler and only need 3-5 steps. The existing one-stage method adopts fixed inference steps, which leads to the redundancy of inference steps and the increase of time complexity in the face of short texts. In the case of long texts, the number of inference steps is insufficient, and the final target area cannot be determined sufficiently, so that the result is wrong.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a named target detection and positioning method based on dynamic self-adaptive reasoning, which is characterized in that a DarkNet pre-training model based on a convolutional neural network is adopted for images, a BERT pre-training model is adopted for texts to respectively extract pictures and language representations, a multi-mode fusion attention mechanism is utilized to perform feature fusion on image and text information, finally, a reinforcement learning reward mechanism algorithm is utilized to perform dynamic self-adaptive reasoning, and the position of a pointed target in the images is detected and positioned. The invention obtains higher accuracy and running speed, and has outstanding improvement on precision and speed compared with the prior model.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: encoding characteristics of text and image information;
step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,
Figure BDA0003899546380000021
where W and H are the width and height of the image, respectively, v k The k-th image block area of the whole image characteristic V is referred to;
step 1-2: performing text feature coding by using a BERT pre-training model;
for the text language description composed of N words, the text language description is changed into a text language description after being coded
Figure BDA0003899546380000022
e n ∈R d Where n represents the word at the nth position in the sentence, e n Word vector for each wordD is the dimension of the word vector;
step 2: multi-modal feature fusion based on attention mechanism;
step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;
the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the visual control; during the t-th reasoning, after the multi-modal feature fusion module is updated, the multi-modal feature fusion module respectively outputs V t And
Figure BDA0003899546380000023
step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanism
Figure BDA0003899546380000024
For each word, and introducing historical cumulative computation scores
Figure BDA0003899546380000025
Figure BDA0003899546380000026
And
Figure BDA0003899546380000027
the calculation is as follows:
Figure BDA0003899546380000028
wherein i is the cumulative calculation of the previous t-1 times,
Figure BDA0003899546380000029
refers to the weight of the nth word under the t round of reasoning,
Figure BDA00038995463800000210
means that the previous t-1 round of reasoningOutput visual feature vector V t-1 The average pooling of (a) is a dot product operation,
Figure BDA00038995463800000211
and
Figure BDA00038995463800000212
different learning parameters for the model;
Figure BDA00038995463800000213
and
Figure BDA00038995463800000214
the value of (A) is in the range of 0-1;
the word vector is updated through the t-th round to:
Figure BDA00038995463800000215
step 2-3: for a visual fusion feature enhancement module under text control, a multi-head self-attention module is adopted to fuse the speech and image features;
the transform basic structure is used, and a 6-layer-8-head layer is adopted, and the details are as follows:
Figure BDA0003899546380000031
Figure BDA0003899546380000032
Figure BDA0003899546380000033
wherein the content of the first and second substances,
Figure BDA0003899546380000034
is the text word vector after the weight calculation, [:]refers to concat operation; convBNReLU refers to convolution, batchNormalize, and ReLU activation function operations; resize refers to a change size operation;
step 2-4: the final predicted output is t x ,t y ,t w ,t h Conf, where t x ,t y ,t w ,t h Position information of a prediction frame in the image is obtained, and conf is the confidence level of the model;
and 3, step 3: reasoning by adopting a dynamic reward mechanism;
step 3-1: aiming at different texts and images, a dynamic reward module is provided, and whether inference is continued or not is determined according to the current situation of a visual-text vector in the tth round;
the visual and text vectors in the t-th round are calculated as follows:
Figure BDA0003899546380000035
the actions are the actions with the highest possibility in the actions _ prob, the inference is continued or stopped, and the actions _ prob is the possibility of predicting the continued inference obtained by calculating the text vector and the visual vector of the current round of inference; e.g. of the type cls Refers to the head vector CLS after BERT coding;
two reinforcement learning reward mechanisms are used, namely final reward and instant reward;
step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:
Figure BDA0003899546380000036
wherein, the IoU is the difference value between the candidate frame O and the real training target frame in the final estimation of the round of calculation during training;
step 3-3: the instant rewards calculate the reward points under the correct associations:
Figure BDA0003899546380000041
wherein, score t Calculating the association score between the visual vector and the text descriptor vector under the t-th round, and finally
Figure BDA0003899546380000042
Representing whether the correlation degree of the multiple modes is improved, and if the forward influence is gradually improved due to the fusion inference in the t step, the correlation degree is 1; otherwise, a punishment mechanism is generated;
step 3-4: for global training of dynamic adaptive reasoning, the total score of the t round dynamic reward module is calculated as follows:
Figure BDA0003899546380000043
using rewarding weight t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;
step 3-5: taking Cross EntropyLoss as training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are both positively activated and the self-credibility of the prediction box is 1.
Preferably, d is 768 dimensions.
The invention has the following beneficial effects:
the invention realizes the detection and positioning of the designated target by using an innovative and efficient dynamic self-adaptive reasoning method. Different from the previous model, the model dynamically and continuously fuses the inference prediction bounding box by directly utilizing the image and the language information, does not need to generate a series of candidate boxes for the picture in two stages, and solves the problems of insufficient inference or calculation redundancy caused by the fact that fixed inference steps are needed in the existing one-stage method, so that higher accuracy and higher operation speed are obtained. The experimental result shows that the model architecture of the invention has outstanding improvement in precision and speed compared with the prior model.
Drawings
FIG. 1 is a diagram showing the structure of the method of the present invention.
FIG. 2 is a diagram of the actual test results of three different pictures according to the embodiment of the present invention, wherein 1 represents a real box, 2 represents the results obtained by the method of the present invention, and 3 represents the results obtained by other optimal methods.
FIG. 3 is a diagram illustrating the distribution of image attention in different inference steps according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in FIG. 1, the invention provides a method for detecting and positioning a nominal target based on dynamic adaptive reasoning, which can dynamically determine the reasoning step number according to the text and image characteristics, and has higher speed and higher accuracy.
The technical scheme of the invention is as follows: the system comprises four parts, wherein the first part is a feature coding process for text and image information, the second part is a multi-mode feature fusion process based on an attention mechanism, and the third part is a dynamic reasoning process for automatically determining the step number. In the first part, pre-training Darknet-53 based on a convolutional neural network is adopted to perform characteristic coding on picture information, and a BERT pre-training model is adopted to perform characteristic coding on text information. In the second part, a multi-mode fusion reasoning mechanism based on an attention mechanism is adopted to reason different words in the text information for the visual characteristics respectively, so that the weight of key information words is enhanced, and the target area characteristics of the visual information are enhanced by the text information. In the third part, a dynamic reward mechanism is proposed by utilizing reinforcement learning to control whether each step of reasoning is correct and dynamically judge whether the reasoning is sufficient or not, if not, the second part of iterative reasoning is continued, and if the image and text characteristics are sufficiently fused and reasoned to obtain a correct answer, the reasoning is stopped.
A nominal target detection positioning method based on dynamic self-adaptive reasoning comprises the following steps:
step 1: encoding characteristics of text and image information;
step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,
Figure BDA0003899546380000051
where W and H are the width and height of the image, respectively, v k The k-th image block area of the whole image characteristic V is referred to;
step 1-2: performing text feature coding by using a BERT pre-training model;
for the text language description composed of N words, the text language description is coded to become
Figure BDA0003899546380000052
e n ∈R d Where n represents the word at the nth position in the sentence, e n A word vector for each word, d being the dimension of the word vector;
step 2: multi-modal feature fusion based on attention mechanism;
step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;
the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the text control; the multi-mode feature fusion module is fused with the reasoning steps, each reasoning is fusion reasoning of two parts of the updated feature fusion module, and the multi-mode feature fusion module respectively outputs V after updating the multi-mode feature fusion module in the t-th reasoning t And
Figure BDA0003899546380000053
step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanism
Figure BDA0003899546380000061
Giving each word and introducing historical cumulative computation scores
Figure BDA0003899546380000062
To avoid the model forgetting the historical inference score problem,
Figure BDA0003899546380000063
and
Figure BDA0003899546380000064
the calculation is as follows:
Figure BDA0003899546380000065
wherein i is the cumulative calculation of the previous t-1 times,
Figure BDA0003899546380000066
refers to the weight of the nth word under the t round of reasoning,
Figure BDA0003899546380000067
refers to the visual feature vector V output by the last t-1 round of reasoning t-1 The average pooling of (a) is a dot product operation,
Figure BDA0003899546380000068
and
Figure BDA0003899546380000069
different learning parameters for the model;
Figure BDA00038995463800000610
and
Figure BDA00038995463800000611
the value of (a) is in the range of 0 to 1;
the word vector is updated through the t-th round to:
Figure BDA00038995463800000612
step 2-3: for a visual fusion feature strengthening module under text control, in order to establish the relationship between languages and images more deeply, a multi-head self-attention module is adopted to fuse the language and image features;
updating a visual vector by using a transform basic structure and adopting a 6-layer-8-head layer through the fusion control of text vectors; by the method, the visual vector output by each round of reasoning has strong coupling association and text information, so that the effectiveness of reasoning is ensured; the method comprises the following specific steps:
Figure BDA00038995463800000613
Figure BDA00038995463800000614
Figure BDA00038995463800000615
wherein the content of the first and second substances,
Figure BDA00038995463800000616
is the text word vector after the weight calculation, [:]refers to concat operation;
step 2-4: the final predicted output is t x ,t y ,t w ,t h Conf, wherein t x ,t y ,t w ,t h Position information of a prediction frame in the image, conf is the self-reliability of the model; splicing the features obtained from the cross-modal attention module, and only depending on the output V of the visual part after the cross-modal fusion t
And step 3: reasoning by adopting a dynamic reward mechanism;
step 3-1: the above steps of fusion reasoning are iterative steps. Aiming at different texts and images, a dynamic reward module is provided, and whether to continue reasoning is determined according to the current situation of the visual-text vector in the t-th round;
the visual and text vectors in the t-th round are calculated as follows:
Figure BDA0003899546380000071
wherein, action decides whether to continue reasoning, and action _ prob is to calculate and obtain the possibility of predicting to continue reasoning according to the text vector and the visual vector of the current round of reasoning; e.g. of the type cls Refers to the head vector CLS after BERT coding;
two reinforcement learning reward mechanisms are used, namely final reward and instant reward;
step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:
Figure BDA0003899546380000072
the IoU is a difference value between a candidate frame O in the final inference of the round and a real training target frame (because a data set does not have fixed inference steps of each sample) calculated during training, and the IoU is fixed to be 1 because the real frame value is 0 during testing;
step 3-3: the instant reward is used for stimulating the positive influence in the process of training reasoning, and the fusion module is used for enabling each word weight to be more closely associated with the characteristics of different visual regions. Thus, the instant prize calculates the prize score under the correct association:
Figure BDA0003899546380000073
wherein, score t Calculating the association score between the visual vector and the text descriptor vector under the t-th round, and finally
Figure BDA0003899546380000074
Representing whether the relevance degree of the multiple modes is improved, if the forward influence of gradual improvement is caused by the fusion reasoning in the t step, the relevance degree is 1; otherwise, a penalty machine is generatedPreparing;
step 3-4: for global training of dynamic adaptive reasoning, the total score of the dynamic reward module in the t round is calculated as follows:
Figure BDA0003899546380000075
using the reward weight t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;
step 3-5: taking Cross EntropyLoss as a training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are positively activated and the confidence level of the prediction box is 1.
The specific embodiment is as follows:
1. image features
Given a picture in a natural scene, the whole picture resize is input into a pre-trained feature extraction network Darknet-53 to encode image features, and the size of the whole picture is 256 × 3.
2. Text features
The longest sentence word number is specified to be 20, and the word vector in the text description after position coding is input into a BERT network to obtain the feature vector of the fusion sentence
Figure BDA0003899546380000081
e∈R d Where e is the characterization of each word, d is the word vector dimension 768 dimensions, and N is 20 at maximum.
3. Multimodal feature fusion enhancement with attention mechanism
The image features are expanded into 512 dimensions, the language features are expanded into (20 x 512) through an MLP network, the position codes are added, and then the image features and the language features are input into the multi-modal attention module together. The module consists of two parts, wherein the inference weight of different words in the text information is enhanced by the visual characteristics, and the regional characteristics of the text information on the visual information are enhanced. TextThe information part mainly carries out word weights w of different positions n And (3) calculating, wherein model calculation is carried out by utilizing the formula (1) of the attention module. The initial instant reward points, the point value is 1. And the visual information part utilizes the weighted text word vector to splice the language features and the image features after the feature enhancement of the previous stage and inputs the spliced language features and image features into the multi-head self-attention layer. The number of the multi-head self-attention layers is 6, and the number of the attention heads is 8.
4. Dynamic reasoning steps under reward mechanism
Under the condition of obtaining the fusion characteristics, selecting a visual characteristic part and a weighted text part, wherein an instant reward mechanism and a final reward mechanism calculate a weight score and return the weight score to a text information weight reasoning part. The gating function is composed of two reward functions and a visual candidate box, and is activated only when the two incentives are equal to 1 and the confidence of the prediction box is 1.
5. Model training
The whole training process is end-to-end training, and four training sets of RefCOCO, refCOCO +, refCOCOCOCOG and Referreasoning are adopted as indexes of model training and evaluation. The batch size is set to 8 and the initial learning rate is set to 1e-4. The model was trained on 8 × titanx GPU for 100 generations, with half the learning rate per 10 epoch trains and gradient descent using Adam's method.
6. Model application
After the training process, a plurality of models can be obtained through each step of storage, the optimal model (the best test effect on the test set) is selected for application, and for the input images and sentences, the images are adjusted to 256x256 size and normalized, and the sentences are subjected to word segmentation operation and can be used as the input of the models. The parameters of the whole network model are fixed, and only image data and language data are input and propagated forwards. The model can automatically carry out dynamic reasoning according to the implicit coupling association degree of the text and the image, and finally obtains a prediction result after proper reasoning. The actual experimental diagrams are shown in fig. 2 and fig. 3, the characters corresponding to the leftmost image in fig. 2 are "a gray laptop notebook computer placed on a desktop and with a page opened", the characters corresponding to the middle image are "a bear in a forest, a young bear on a rock and a young bear on a tree climbing", the characters corresponding to the right image are "a sandwich with lettuce hung on the upper left corner of a bread", and the experimental result shows that the accurate position of the description information of the relevant sentence in the image can be efficiently given based on the nominal target detection positioning of the dynamic adaptive inference.

Claims (2)

1. A nominal target detection positioning method based on dynamic self-adaptive reasoning is characterized by comprising the following steps:
step 1: encoding characteristics of text and image information;
step 1-1: the image is encoded by Darknet-53 convolutional neural network to obtain 256 × 3 dimensions, the whole image feature vector is marked as V,
Figure FDA0003899546370000011
where W and H are the width and height of the image, respectively, v k The k-th image block area of the whole image characteristic V is referred to;
step 1-2: performing text feature coding by using a BERT pre-training model;
for the text language description composed of N words, the text language description is coded to become
Figure FDA0003899546370000012
e n ∈R d Where n represents the word at the nth position in the sentence, e n A word vector for each word, d being the dimension of the word vector;
step 2: multi-modal feature fusion based on attention mechanism;
step 2-1: inputting E and V into a multi-modal feature fusion module based on an attention mechanism;
the multi-modal feature fusion module based on the attention mechanism comprises a language text attention module under visual control and a visual fusion feature strengthening module under the text control; during the t-th reasoning, after the multi-modal feature fusion module is updated, the multi-modal feature fusionThe modules respectively output V t And
Figure FDA0003899546370000013
step 2-2: for the language text attention module under visual control, a weight score is constructed by adopting an attention mechanism
Figure FDA0003899546370000014
For each word, and introducing historical cumulative computation scores
Figure FDA0003899546370000015
And
Figure FDA0003899546370000016
the calculation is as follows:
Figure FDA0003899546370000017
wherein i is the cumulative calculation of the previous t-1 times,
Figure FDA0003899546370000018
refers to the weight of the nth word under the t round of reasoning,
Figure FDA0003899546370000019
refers to the visual characteristic vector V output by the last t-1 round of reasoning t-1 The average pooling of (a) is a dot product operation,
Figure FDA00038995463700000110
and
Figure FDA00038995463700000111
different learning parameters for the model;
Figure FDA00038995463700000112
and
Figure FDA00038995463700000113
the value of (A) is in the range of 0-1;
the word vector is updated through the t-th round to:
Figure FDA00038995463700000114
step 2-3: for a visual fusion feature strengthening module under text control, a multi-head self-attention module is adopted to fuse the language and image features;
the transform basic structure is used, and a 6-layer-8-head layer is adopted, and the details are as follows:
Figure FDA0003899546370000021
wherein the content of the first and second substances,
Figure FDA0003899546370000022
is the text word vector after the weight calculation, [:]refers to concat operation; convBNReLU refers to convolution, batchNormalize, and ReLU activation function operations; resize refers to a change size operation;
step 2-4: the final predicted output is t x ,t y ,t w ,t h Conf, wherein t x ,t y ,t w ,t h Position information of a prediction frame in the image is obtained, and conf is the confidence level of the model;
and step 3: reasoning by adopting a dynamic reward mechanism;
step 3-1: aiming at different texts and images, a dynamic reward module is provided, and whether to continue reasoning is determined according to the current situation of the visual-text vector in the t-th round;
the visual and text vectors in the t-th round are calculated as follows:
Figure FDA0003899546370000023
wherein, action is the action with the highest possibility in action _ prob, and continuous reasoning or and stopping reasoning are taken, and action _ prob is the possibility of predicting continuous reasoning obtained by calculation according to the text vector and the visual vector of the current round of reasoning; e.g. of a cylinder cls Refers to the head vector CLS after BERT coding;
two reinforcement learning reward mechanisms are used, namely final reward and instant reward;
step 3-2: the final reward is a reward value obtained by calculating the difference between the reasoning result of the current round and the real box, namely, the final reward is calculated according to the candidate box O generated in the current round and is defined as follows:
Figure FDA0003899546370000024
wherein, the IoU is the difference between a candidate frame O in the final estimation of the round and a real training target frame calculated during training;
step 3-3: the instant rewards calculate the reward points under the correct associations:
Figure FDA0003899546370000031
wherein, score t Calculating the association score between the visual vector and the text descriptor vector in the t-th round and finally
Figure FDA0003899546370000032
Representing whether the relevance degree of the multiple modes is improved, if the forward influence of gradual improvement is caused by the fusion reasoning in the t step, the relevance degree is 1; otherwise, a penalty mechanism is generated;
step 3-4: for global training of dynamic adaptive reasoning, the total score of the dynamic reward module in the t round is calculated as follows:
Figure FDA0003899546370000033
using rewarding weight t Inputting the attention module of the language text part of the fusion inference module in the next round to carry out the next inference;
step 3-5: taking Cross EntropyLoss as training loss, and obtaining by calculating the difference between a prediction frame and a real frame of each region in an image; and the dynamic reward module judges whether to continue reasoning according to the visual characteristics after reasoning, and stops reasoning when the final reward and the instant reward are both positively activated and the self-credibility of the prediction box is 1.
2. The method of claim 1, wherein d is 768 dimensions.
CN202211286108.2A 2022-10-20 2022-10-20 Dynamic adaptive inference-based nominal target detection and positioning method Pending CN115661842A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211286108.2A CN115661842A (en) 2022-10-20 2022-10-20 Dynamic adaptive inference-based nominal target detection and positioning method
PCT/CN2023/123906 WO2024037664A1 (en) 2022-10-20 2023-10-11 Referring target detection and positioning method based on dynamic adaptive reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211286108.2A CN115661842A (en) 2022-10-20 2022-10-20 Dynamic adaptive inference-based nominal target detection and positioning method

Publications (1)

Publication Number Publication Date
CN115661842A true CN115661842A (en) 2023-01-31

Family

ID=84989042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211286108.2A Pending CN115661842A (en) 2022-10-20 2022-10-20 Dynamic adaptive inference-based nominal target detection and positioning method

Country Status (2)

Country Link
CN (1) CN115661842A (en)
WO (1) WO2024037664A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117196546A (en) * 2023-11-08 2023-12-08 杭州实在智能科技有限公司 RPA flow executing system and method based on page state understanding and large model driving
WO2024037664A1 (en) * 2022-10-20 2024-02-22 西北工业大学 Referring target detection and positioning method based on dynamic adaptive reasoning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241191A (en) * 2021-12-19 2022-03-25 西北工业大学 Cross-modal self-attention-based non-candidate-box expression understanding method
CN115062174A (en) * 2022-06-16 2022-09-16 电子科技大学 End-to-end image subtitle generating method based on semantic prototype tree
CN115661842A (en) * 2022-10-20 2023-01-31 西北工业大学 Dynamic adaptive inference-based nominal target detection and positioning method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024037664A1 (en) * 2022-10-20 2024-02-22 西北工业大学 Referring target detection and positioning method based on dynamic adaptive reasoning
CN117196546A (en) * 2023-11-08 2023-12-08 杭州实在智能科技有限公司 RPA flow executing system and method based on page state understanding and large model driving

Also Published As

Publication number Publication date
WO2024037664A1 (en) 2024-02-22

Similar Documents

Publication Publication Date Title
CN109299262B (en) Text inclusion relation recognition method fusing multi-granularity information
CN109992648B (en) Deep text matching method and device based on word migration learning
KR102532749B1 (en) Method and apparatus for hierarchical learning of neural networks based on weak supervised learning
CN108415977B (en) Deep neural network and reinforcement learning-based generative machine reading understanding method
CN115661842A (en) Dynamic adaptive inference-based nominal target detection and positioning method
CN110188358B (en) Training method and device for natural language processing model
CN109947912A (en) A kind of model method based on paragraph internal reasoning and combined problem answer matches
CN110334354A (en) A kind of Chinese Relation abstracting method
CN109543820B (en) Image description generation method based on architecture phrase constraint vector and double vision attention mechanism
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN113656570A (en) Visual question answering method and device based on deep learning model, medium and equipment
CN112699682A (en) Named entity identification method and device based on combinable weak authenticator
CN110399454B (en) Text coding representation method based on transformer model and multiple reference systems
CN115080715B (en) Span extraction reading understanding method based on residual structure and bidirectional fusion attention
CN116110022B (en) Lightweight traffic sign detection method and system based on response knowledge distillation
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN115186147B (en) Dialogue content generation method and device, storage medium and terminal
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN116564355A (en) Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion
CN112560948A (en) Eye fundus map classification method and imaging method under data deviation
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN116186250A (en) Multi-mode learning level mining method, system and medium under small sample condition
CN110991515A (en) Image description method fusing visual context
CN117034961B (en) BERT-based medium-method inter-translation quality assessment method
CN111582287B (en) Image description method based on sufficient visual information and text information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhang Yanning

Inventor after: Wang Peng

Inventor after: Zhang Zhipeng

Inventor after: Wei Zhimin

Inventor before: Wang Peng

Inventor before: Zhang Zhipeng

Inventor before: Wei Zhimin

Inventor before: Zhang Yanning

CB03 Change of inventor or designer information