CN114239594A - Natural language visual reasoning method based on attention mechanism - Google Patents
Natural language visual reasoning method based on attention mechanism Download PDFInfo
- Publication number
- CN114239594A CN114239594A CN202111476196.8A CN202111476196A CN114239594A CN 114239594 A CN114239594 A CN 114239594A CN 202111476196 A CN202111476196 A CN 202111476196A CN 114239594 A CN114239594 A CN 114239594A
- Authority
- CN
- China
- Prior art keywords
- representing
- candidate object
- module
- bounding box
- ith
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 21
- 230000014509 gene expression Effects 0.000 claims abstract description 45
- 238000001514 detection method Methods 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000013604 expression vector Substances 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- XOOUIPVCVHRTMJ-UHFFFAOYSA-L zinc stearate Chemical compound [Zn+2].CCCCCCCCCCCCCCCCCC([O-])=O.CCCCCCCCCCCCCCCCCC([O-])=O XOOUIPVCVHRTMJ-UHFFFAOYSA-L 0.000 claims 1
- 238000013527 convolutional neural network Methods 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides a natural language visual reasoning method based on an attention mechanism. Firstly, inputting a language expression, processing by utilizing one-hot codes, BilSTM codes and the like, and calculating phrase embedded expression and weight for the three visual processing modules according to the language expression; then, performing target detection on the input image by using a Mask R-CNN detector, and respectively inputting the detection result into a subject module, a position module and a relation module, wherein each module respectively calculates the matching score; and finally, calculating the weighted sum of the matching scores of the three modules as an overall matching score, taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image. The invention has better context information comprehension and can process expressions with various structures.
Description
Technical Field
The invention belongs to the technical field of computer vision and natural language processing, and particularly relates to a natural language vision reasoning method based on an attention mechanism.
Background
The expression understanding refers to positioning an object region described in a natural language in an image. Namely: inputting a picture (containing people or other objects), inputting a natural language description (named expression) capable of identifying a specific object in the picture, wherein the description is an English word, phrase or sentence, and can contain attributes such as the category, position, color, size and relation with surrounding objects of the object. It is required to locate the region of the object described in the picture (the object is framed and segmented by a bounding box). The expression interpretation is a meaningful task that can be applied to image retrieval, such as finding objects with specific attributes in a picture library. In addition, the named expression understanding is also an important technology for machine understanding of the real world and communicating with human like human, and can be applied to a visual understanding and conversation system of modern intelligent equipment.
Mao et al in the literature "J.Mao, J.Huang, A.Toshev, O.Camburg, L.Yuille, and K.Murphy," Generation and compliance of unambiguous object descriptors, "Proc.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.11-20, 2016" use the long-short-time memory network LSTM structure to establish the probability model P (r | o), and find the probability-maximizing object o. A set of candidate regions is first generated and then ranked according to probability. Rohrbach et al in the literature "Rohrbach, m.rohrbach, r.hu, t.darell, and b.schiole," grouping of textual phrases in images by y retrieval, "proc.european Conference on Computer Vision (ECCV), pp.817-834,2016," use joint embedding models to directly calculate P (r | o), learn image-text embedding using a two-view neural network, followed by two non-linear layers of image-text representations that can be obtained by two pre-trained networks and an off-the-shelf feature extraction network. In combination with the above two methods, L.Yu et al, in the literature "L.Yu, H.Tan, M.Bansal, and L.berg," A joint nozzle-inside-resistor model for the feedback expressions, "Proc.IEEE Conference on Computer Vision and Pattern Registration (CVPR)," pp.7282-7290,2017 "proposed a model combining CNN-LSTM and an embedding model to achieve better performance. The model can jointly learn the "speaker" model of CNN-LSTM and the "listener" model based on embedding, for the task of generation and understanding of the named expressions. In addition, a reward-based authentication enhancer is added to guide the sampling of the more discriminative expression, further improving the system. The model does not work independently, but rather allows "speakers", "listeners" and "reinforcers" to interact, thereby improving the performance of generating and understanding tasks. However, the method has insufficient understanding of the context information of the named expression, and the final positioning result is inaccurate.
Disclosure of Invention
In order to overcome the problem of insufficient or inaccurate understanding of the context information of the named expression in the prior art, the invention provides a natural language visual reasoning method based on an attention mechanism. Firstly, inputting a language expression, processing the language expression by utilizing one-hot codes, BilSTM codes and the like, and calculating phrase embedded representation and weight for the three visual processing modules according to the language expression; then, performing target detection on the input image by using a Mask R-CNN detector, taking the detected target as a candidate object of the image, and respectively inputting the target into a subject module, a position module and a relation module, wherein each module respectively calculates to obtain a corresponding matching score; and finally, calculating the weighted sum of the matching scores of the three modules as an overall matching score, taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image. The invention adopts an end-to-end modular network, each module can pay attention to the word to be paid attention through learning, has better context information comprehension, can adaptively input the designated expression and can process expressions with various structures.
A natural language vision reasoning method based on an attention mechanism is characterized by comprising the following steps:
step 1: encoding each word in an input language expression into an embedded representation vector e using one-hot encodingtThen using BilSTM to code the context of each word, and connecting the obtained hidden vectors in front and back directionsNext, a hidden representation vector h for each word is obtainedtT represents the word serial number in the expression, and T is 1, 2.. and T represents the number of words contained in the expression;
step 2: calculating the attention degree of each word by different modules according to the following formula:
wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, am,tRepresenting the attention of the module m to the t-th word, fmRepresenting a module m trainable vector;
the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:
wherein q ismPhrase embedding representing module m;
and step 3: the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:
wherein, wsubWeight, w, representing subject modulelocWeight, w, of the location modulerelRepresenting the weights of the relational modules, softmax (·) representing a normalized exponential function for calculating the weight of each module, WmRepresenting the attention degree of each module to the word; h is1Hidden representation vector, h, representing the first word in a linguistic expressionTHidden representation vector representing the last word, bmRepresents a bias;
and 4, step 4: performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector;
and 5: combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;
dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:
Ha=tanh(WvV+Wqqsub) (4)
wherein HaPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, WvWeights representing the spatial grid; wqRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; wh,aRepresenting a weight of each word on the lattice; a isvIndicating the attention value of the grid;
calculating the components V of the spatial grid feature V according to the formulaiTo obtain a visual representation vector of the candidate:
wherein the content of the first and second substances,a subject visual representation representing candidate i,indicating the attention value on the ith grid,virepresenting the characteristics of the ith grid, and G representing the number of the grids;
computing visual representation vectorsAnd phrase-embedded representation vector qsubThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:
wherein o isiDenotes the ith candidate, S (o)i|qsub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representing a match function, consisting of two multi-layered perceptrons and L2 regularization;
step 6: embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:
wherein liA visual representation representing the absolute position of the ith candidate, i 1,2, N being the number of candidates identified by the Mask R-CNN detector,an abscissa value representing the upper left corner of the ith candidate bounding box,an ordinate value representing the upper left corner of the ith candidate bounding box,representing the lower right corner of the ith candidate bounding boxThe value of the abscissa of (a) is,ordinate value, w, representing the lower right corner of the i-th candidate object bounding boxiWidth, h, of bounding box representing ith candidateiRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;
then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:
wherein, δ lijDenotes the relative position of the ith candidate object and the ith candidate object, i, j ═ 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]tl]ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]br]ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, wjWidth, h, of bounding box representing jth candidatejHigh representing the bounding box of the jth candidate;
finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculatedlocThe similarity betweenAnd degree, taking the similarity value as the matching score of the position module, and calculating the expression as follows:
wherein, S (o)i|qloc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;
and 7: embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:
wherein, δ mijRepresents the relative position of the ith candidate object and the jth surrounding object, each candidate object has 8 surrounding objects, the surrounding object is the candidate object with the minimum Euclidean distance to the candidate object, i is 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]tl]ijDenotes the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ x [ ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]br]ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, wjWidth, h, of bounding box representing jth surrounding objectjHigh of the bounding box representing the jth surrounding object;
then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:
wherein the content of the first and second substances,a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. ofijFeatures C4, b of jth surrounding object representing ith candidate objectrA bias representing a relationship module;
finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objectsrelThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:
wherein, S (o)i|qrel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;
and 8: and calculating the total matching score Si of each candidate object according to the following formula:
Si=wsub×S(oi|qsub)+wloc×S(oi|qloc)+wrel×S(oi|qrel) (15)
wherein, i is 1, 2.. times.n;
and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.
The invention has the beneficial effects that: (1) the context information understanding is more accurate. The invention adopts an end-to-end modular network, has finer-grained word weight distribution capability, and each module can focus on the word to be focused through learning, so that the model has better comprehension. (2) Not too much dependent on an external language parser. The language analysis network designed by the invention can adaptively input the designated expression, is less limited and can process expressions with various structures.
Drawings
FIG. 1 is a flow chart of a natural language visual inference method based on attention mechanism according to the present invention;
FIG. 2 is an image of the inference results obtained using the method of the present invention;
wherein, (a) -input named expression, (b) -input original image, and (c) -inference result image obtained by the invention.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.
As shown in fig. 1, the present invention provides a natural language vision reasoning method based on attention mechanism, which mainly comprises a language attention network module and three vision processing modules, and the specific implementation process is as follows:
1. language attention network module
(1) Coding each word in the input language expression into an embedded expression vector et by adopting one-hot coding, coding the context of each word by using BilSTM, connecting the obtained hidden vectors in the front and rear directions to obtain a hidden expression vector h of each wordtT represents the word serial number in the expression, and T is 1, 2.. and T represents the number of words contained in the expression;
(2) calculating the attention degree of each word by different modules according to the following formula:
wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, am,tRepresenting the attention of the module m to the t-th word, fmRepresenting a module m trainable vector;
the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:
wherein q ismPhrase embedding representing module m;
(3) the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:
wherein, wsubWeight, w, representing subject modulelocWeight, w, of the location modulerelRepresenting the weight of the relation module, softmax ((-)) representing a normalized exponential function for calculating the weight of each module, and Wm representing the attention degree of each module to the word; h is1Hidden representation vector, h, representing the first word in a linguistic expressionTHidden representation vector representing the last word, bmRepresents a bias;
2. input image object detection
Performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector.
3. Subject module
Combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;
dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:
Ha=tanh(WvV+Wqqsub) (19)
wherein HaPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, WvWeights representing the spatial grid; wqRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; wh,aRepresenting a weight of each word on the lattice; a isvIndicating the attention value of the grid;
calculating the components V of the spatial grid feature V according to the formulaiTo obtain a visual representation vector of the candidate:
wherein the content of the first and second substances,a subject visual representation representing candidate i,indicates the attention value on the ith grid, viRepresenting the characteristics of the ith grid, and G representing the number of the grids;
computing visual representation vectorsAnd phrase-embedded representation vector qsubThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:
wherein o isiDenotes the ith candidate, S (o)i|qsub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representsA matching function consisting of two multi-layer perceptrons and L2 regularization;
4. position module
Embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:
wherein liA visual representation representing the absolute position of the ith candidate, i 1,2, N being the number of candidates identified by the Mask R-CNN detector,an abscissa value representing the upper left corner of the ith candidate bounding box,an ordinate value representing the upper left corner of the ith candidate bounding box,an abscissa value representing the lower right corner of the ith candidate object bounding box,ordinate value, w, representing the lower right corner of the i-th candidate object bounding boxiWidth, h, of bounding box representing ith candidateiRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;
then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:
wherein, deltalijDenotes the relative position of the ith candidate object and the ith candidate object, i, j ═ 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]tl]ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]br]ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, wjWidth, h, of bounding box representing jth candidatejHigh representing the bounding box of the jth candidate;
finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculatedlocThe similarity between the position modules is taken as a matching score of the position modules, and the calculation expression is as follows:
wherein, S (o)i|qloc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;
5. relationship module
Embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:
wherein, δ mijRepresents the relative position of the ith candidate object and the jth surrounding object, each candidate object has 8 surrounding objects, the surrounding object is the candidate object with the minimum Euclidean distance to the candidate object, i is 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]tl]ijDenotes the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ x [ ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]br]ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, wjWidth, h, of bounding box representing jth surrounding objectjHigh of the bounding box representing the jth surrounding object;
then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:
wherein the content of the first and second substances,a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. ofijFeatures C4, b of jth surrounding object representing ith candidate objectrA bias representing a relationship module;
finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objectsrelThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:
wherein, S (o)i|qrel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;
6. computing visual reasoning results
And calculating the total matching score Si of each candidate object according to the following formula:
Si=wsub×S(oi|qsub)+wloc×S(oi|qloc)+wrel×S(oi|qrel) (30)
wherein, i is 1, 2.. times.n;
and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.
To verify the effectiveness of the method of the present invention, simulation experiments were performed on 1080Ti video card, the pytorech framework, and the ubuntu18 OS of the video memory 11G. The data sets used in the experiments were three published, named expression understanding data sets collected from the COCO data set, respectively: RefCOCO, RefCOCO +, RefCOCOcog. First, the Mask R-CNN is trained using a subset of the COCO dataset and the pre-trained model weights of the ResNet152 feature extraction network. The number of iterations was set at 1250000 and the learning rate was 0.001. And then, respectively extracting the characteristics of the images in the RefCOCO, RefCOCO and RefCOCOCOG data sets by using the Mask R-CNN obtained by training and storing the characteristics in a file. Finally, the extracted features are processed by the method of the invention. The calculated weights for the three visual modules are given in fig. 1 as 0.5,0.29,0.21, scoresubScore, representing the matching score of the subject modulelocMatch score, representing a location modulerelMatch score, representing a relationship moduleoverallRepresenting an overall match score. FIG. 2 shows the reasoning results of one of the images, which can prove that the invention is used in the task of understanding the named expressionsAccuracy and effectiveness of understanding contextual semantic information.
Claims (1)
1. A natural language vision reasoning method based on an attention mechanism is characterized by comprising the following steps:
step 1: encoding each word in an input language expression into an embedded representation vector e using one-hot encodingtThen using BilSTM to code the context of each word, connecting the obtained hidden vectors in the front and back directions to obtain the hidden expression vector h of each wordtT represents the word number in the expression, T is 1,2, …, T represents the number of words contained in the expression;
step 2: calculating the attention degree of each word by different modules according to the following formula:
wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, am,tRepresenting the attention of the module m to the t-th word, fmRepresenting a module m trainable vector;
the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:
wherein q ismPhrase embedding representing module m;
and step 3: the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:
wherein, wsubWeight, w, representing subject modulelocWeight, w, of the location modulerelRepresenting the weights of the relational modules, softmax (·) representing a normalized exponential function for calculating the weight of each module, WmRepresenting the attention degree of each module to the word; h is1Hidden representation vector, h, representing the first word in a linguistic expressionTHidden representation vector representing the last word, bmRepresents a bias;
and 4, step 4: performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector;
and 5: combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;
dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:
Ha=tanh(WvV+Wqqsub) (4)
wherein HaPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, WvWeights representing the spatial grid; wqRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; wh,aRepresenting a weight of each word on the lattice; a isvIndicating the attention value of the grid;
calculating the components V of the spatial grid feature V according to the formulaiTo obtain a visual representation vector of the candidate:
wherein the content of the first and second substances,a subject visual representation representing candidate i,indicates the attention value on the ith grid, viRepresenting the characteristics of the ith grid, and G representing the number of the grids;
computing visual representation vectorsAnd phrase-embedded representation vector qsubThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:
wherein o isiDenotes the ith candidate, S (o)i|qsub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representing a match function, consisting of two multi-layered perceptrons and L2 regularization;
step 6: embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:
wherein liA visual representation representing the absolute position of the ith candidate object, i ═ 1,2, …, N being MaThe sk R-CNN detector detects the number of identified candidates,an abscissa value representing the upper left corner of the ith candidate bounding box,an ordinate value representing the upper left corner of the ith candidate bounding box,an abscissa value representing the lower right corner of the ith candidate object bounding box,ordinate value, w, representing the lower right corner of the i-th candidate object bounding boxiWidth, h, of bounding box representing ith candidateiRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;
then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:
wherein, δ lijIndicates the relative position of the ith candidate object and the ith candidate object, i, j is 1,2, …, N, [ Δ x [ ]tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]tl]ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]br]ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, wjRepresents the jth candidateWidth of bounding box of object, hjHigh representing the bounding box of the jth candidate;
finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculatedlocThe similarity between the position modules is taken as a matching score of the position modules, and the calculation expression is as follows:
wherein, S (o)i|qloc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;
and 7: embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:
wherein, δ mijRepresenting the relative position representation of the ith candidate object and the jth surrounding object thereof, each candidate object having 8 surrounding objects, the surrounding object being the candidate object with the smallest euclidean distance to the candidate object, i ═ 1,2, …, N, j ═ 1,2, …,8, [ Δ x ] xtl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]tl]ijRepresents the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Delta ]xbr]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]br]ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, wjWidth, h, of bounding box representing jth surrounding objectjHigh of the bounding box representing the jth surrounding object;
then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:
wherein the content of the first and second substances,a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. ofijFeatures C4, b of jth surrounding object representing ith candidate objectrA bias representing a relationship module;
finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objectsrelThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:
wherein, S (o)i|qrel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;
and 8: calculating the total matching score S of each candidate object according to the following formulai:
Si=wsub×S(oi|qsub)+wloc×S(oi|qloc)+wrel×S(oi|qrel) (15)
Wherein i is 1,2, …, N;
and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111476196.8A CN114239594B (en) | 2021-12-06 | 2021-12-06 | Natural language visual reasoning method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111476196.8A CN114239594B (en) | 2021-12-06 | 2021-12-06 | Natural language visual reasoning method based on attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114239594A true CN114239594A (en) | 2022-03-25 |
CN114239594B CN114239594B (en) | 2024-03-08 |
Family
ID=80753370
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111476196.8A Active CN114239594B (en) | 2021-12-06 | 2021-12-06 | Natural language visual reasoning method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114239594B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170200066A1 (en) * | 2016-01-13 | 2017-07-13 | Adobe Systems Incorporated | Semantic Natural Language Vector Space |
US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
CN110390289A (en) * | 2019-07-17 | 2019-10-29 | 苏州大学 | Based on the video security protection detection method for censuring understanding |
US20210192274A1 (en) * | 2019-12-23 | 2021-06-24 | Tianjin University | Visual relationship detection method and system based on adaptive clustering learning |
-
2021
- 2021-12-06 CN CN202111476196.8A patent/CN114239594B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170200066A1 (en) * | 2016-01-13 | 2017-07-13 | Adobe Systems Incorporated | Semantic Natural Language Vector Space |
US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
CN110390289A (en) * | 2019-07-17 | 2019-10-29 | 苏州大学 | Based on the video security protection detection method for censuring understanding |
US20210192274A1 (en) * | 2019-12-23 | 2021-06-24 | Tianjin University | Visual relationship detection method and system based on adaptive clustering learning |
Non-Patent Citations (2)
Title |
---|
侯星晨;王锦;: "基于自适应注意模型的图像描述", 计算机与现代化, no. 06, 15 June 2020 (2020-06-15) * |
罗会兰;岳亮亮;: "跨层多模型特征融合与因果卷积解码的图像描述", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12) * |
Also Published As
Publication number | Publication date |
---|---|
CN114239594B (en) | 2024-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Aneja et al. | Convolutional image captioning | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN109508654B (en) | Face analysis method and system fusing multitask and multi-scale convolutional neural network | |
CN112000818B (en) | Text and image-oriented cross-media retrieval method and electronic device | |
Gupta et al. | Integration of textual cues for fine-grained image captioning using deep CNN and LSTM | |
Mishra et al. | The understanding of deep learning: A comprehensive review | |
Katiyar et al. | Image captioning using deep stacked LSTMs, contextual word embeddings and data augmentation | |
CN115331075A (en) | Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph | |
CN114239612A (en) | Multi-modal neural machine translation method, computer equipment and storage medium | |
CN114254645A (en) | Artificial intelligence auxiliary writing system | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN113095405B (en) | Method for constructing image description generation system based on pre-training and double-layer attention | |
CN112269892B (en) | Based on multi-mode is unified at many levels Interactive phrase positioning and identifying method | |
CN112818683A (en) | Chinese character relationship extraction method based on trigger word rule and Attention-BilSTM | |
CN112613451A (en) | Modeling method of cross-modal text picture retrieval model | |
He et al. | Image captioning algorithm based on multi-branch cnn and bi-lstm | |
CN116311518A (en) | Hierarchical character interaction detection method based on human interaction intention information | |
Akman et al. | Lip reading multiclass classification by using dilated CNN with Turkish dataset | |
CN116434058A (en) | Image description generation method and system based on visual text alignment | |
CN114239594B (en) | Natural language visual reasoning method based on attention mechanism | |
CN114066844A (en) | Pneumonia X-ray image analysis model and method based on attention superposition and feature fusion | |
CN114357166A (en) | Text classification method based on deep learning | |
Htwe et al. | Building annotated image dataset for Myanmar text to image synthesis | |
Liu et al. | Multimodal cross-guided attention networks for visual question answering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |