CN114239594A - Natural language visual reasoning method based on attention mechanism - Google Patents

Natural language visual reasoning method based on attention mechanism Download PDF

Info

Publication number
CN114239594A
CN114239594A CN202111476196.8A CN202111476196A CN114239594A CN 114239594 A CN114239594 A CN 114239594A CN 202111476196 A CN202111476196 A CN 202111476196A CN 114239594 A CN114239594 A CN 114239594A
Authority
CN
China
Prior art keywords
representing
candidate object
module
bounding box
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111476196.8A
Other languages
Chinese (zh)
Other versions
CN114239594B (en
Inventor
王�琦
许杰
袁媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202111476196.8A priority Critical patent/CN114239594B/en
Publication of CN114239594A publication Critical patent/CN114239594A/en
Application granted granted Critical
Publication of CN114239594B publication Critical patent/CN114239594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a natural language visual reasoning method based on an attention mechanism. Firstly, inputting a language expression, processing by utilizing one-hot codes, BilSTM codes and the like, and calculating phrase embedded expression and weight for the three visual processing modules according to the language expression; then, performing target detection on the input image by using a Mask R-CNN detector, and respectively inputting the detection result into a subject module, a position module and a relation module, wherein each module respectively calculates the matching score; and finally, calculating the weighted sum of the matching scores of the three modules as an overall matching score, taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image. The invention has better context information comprehension and can process expressions with various structures.

Description

Natural language visual reasoning method based on attention mechanism
Technical Field
The invention belongs to the technical field of computer vision and natural language processing, and particularly relates to a natural language vision reasoning method based on an attention mechanism.
Background
The expression understanding refers to positioning an object region described in a natural language in an image. Namely: inputting a picture (containing people or other objects), inputting a natural language description (named expression) capable of identifying a specific object in the picture, wherein the description is an English word, phrase or sentence, and can contain attributes such as the category, position, color, size and relation with surrounding objects of the object. It is required to locate the region of the object described in the picture (the object is framed and segmented by a bounding box). The expression interpretation is a meaningful task that can be applied to image retrieval, such as finding objects with specific attributes in a picture library. In addition, the named expression understanding is also an important technology for machine understanding of the real world and communicating with human like human, and can be applied to a visual understanding and conversation system of modern intelligent equipment.
Mao et al in the literature "J.Mao, J.Huang, A.Toshev, O.Camburg, L.Yuille, and K.Murphy," Generation and compliance of unambiguous object descriptors, "Proc.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.11-20, 2016" use the long-short-time memory network LSTM structure to establish the probability model P (r | o), and find the probability-maximizing object o. A set of candidate regions is first generated and then ranked according to probability. Rohrbach et al in the literature "Rohrbach, m.rohrbach, r.hu, t.darell, and b.schiole," grouping of textual phrases in images by y retrieval, "proc.european Conference on Computer Vision (ECCV), pp.817-834,2016," use joint embedding models to directly calculate P (r | o), learn image-text embedding using a two-view neural network, followed by two non-linear layers of image-text representations that can be obtained by two pre-trained networks and an off-the-shelf feature extraction network. In combination with the above two methods, L.Yu et al, in the literature "L.Yu, H.Tan, M.Bansal, and L.berg," A joint nozzle-inside-resistor model for the feedback expressions, "Proc.IEEE Conference on Computer Vision and Pattern Registration (CVPR)," pp.7282-7290,2017 "proposed a model combining CNN-LSTM and an embedding model to achieve better performance. The model can jointly learn the "speaker" model of CNN-LSTM and the "listener" model based on embedding, for the task of generation and understanding of the named expressions. In addition, a reward-based authentication enhancer is added to guide the sampling of the more discriminative expression, further improving the system. The model does not work independently, but rather allows "speakers", "listeners" and "reinforcers" to interact, thereby improving the performance of generating and understanding tasks. However, the method has insufficient understanding of the context information of the named expression, and the final positioning result is inaccurate.
Disclosure of Invention
In order to overcome the problem of insufficient or inaccurate understanding of the context information of the named expression in the prior art, the invention provides a natural language visual reasoning method based on an attention mechanism. Firstly, inputting a language expression, processing the language expression by utilizing one-hot codes, BilSTM codes and the like, and calculating phrase embedded representation and weight for the three visual processing modules according to the language expression; then, performing target detection on the input image by using a Mask R-CNN detector, taking the detected target as a candidate object of the image, and respectively inputting the target into a subject module, a position module and a relation module, wherein each module respectively calculates to obtain a corresponding matching score; and finally, calculating the weighted sum of the matching scores of the three modules as an overall matching score, taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image. The invention adopts an end-to-end modular network, each module can pay attention to the word to be paid attention through learning, has better context information comprehension, can adaptively input the designated expression and can process expressions with various structures.
A natural language vision reasoning method based on an attention mechanism is characterized by comprising the following steps:
step 1: encoding each word in an input language expression into an embedded representation vector e using one-hot encodingtThen using BilSTM to code the context of each word, and connecting the obtained hidden vectors in front and back directionsNext, a hidden representation vector h for each word is obtainedtT represents the word serial number in the expression, and T is 1, 2.. and T represents the number of words contained in the expression;
step 2: calculating the attention degree of each word by different modules according to the following formula:
Figure BDA0003393500370000021
wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, am,tRepresenting the attention of the module m to the t-th word, fmRepresenting a module m trainable vector;
the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:
Figure BDA0003393500370000022
wherein q ismPhrase embedding representing module m;
and step 3: the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:
Figure BDA0003393500370000031
wherein, wsubWeight, w, representing subject modulelocWeight, w, of the location modulerelRepresenting the weights of the relational modules, softmax (·) representing a normalized exponential function for calculating the weight of each module, WmRepresenting the attention degree of each module to the word; h is1Hidden representation vector, h, representing the first word in a linguistic expressionTHidden representation vector representing the last word, bmRepresents a bias;
and 4, step 4: performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector;
and 5: combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;
dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:
Ha=tanh(WvV+Wqqsub) (4)
Figure BDA0003393500370000032
wherein HaPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, WvWeights representing the spatial grid; wqRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; wh,aRepresenting a weight of each word on the lattice; a isvIndicating the attention value of the grid;
calculating the components V of the spatial grid feature V according to the formulaiTo obtain a visual representation vector of the candidate:
Figure BDA0003393500370000033
wherein the content of the first and second substances,
Figure BDA0003393500370000034
a subject visual representation representing candidate i,
Figure BDA0003393500370000035
indicating the attention value on the ith grid,virepresenting the characteristics of the ith grid, and G representing the number of the grids;
computing visual representation vectors
Figure BDA0003393500370000036
And phrase-embedded representation vector qsubThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:
Figure BDA0003393500370000041
wherein o isiDenotes the ith candidate, S (o)i|qsub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representing a match function, consisting of two multi-layered perceptrons and L2 regularization;
step 6: embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:
Figure BDA0003393500370000042
wherein liA visual representation representing the absolute position of the ith candidate, i 1,2, N being the number of candidates identified by the Mask R-CNN detector,
Figure BDA0003393500370000043
an abscissa value representing the upper left corner of the ith candidate bounding box,
Figure BDA0003393500370000044
an ordinate value representing the upper left corner of the ith candidate bounding box,
Figure BDA0003393500370000045
representing the lower right corner of the ith candidate bounding boxThe value of the abscissa of (a) is,
Figure BDA0003393500370000046
ordinate value, w, representing the lower right corner of the i-th candidate object bounding boxiWidth, h, of bounding box representing ith candidateiRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;
then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:
Figure BDA0003393500370000047
wherein, δ lijDenotes the relative position of the ith candidate object and the ith candidate object, i, j ═ 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]tl]ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]br]ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, wjWidth, h, of bounding box representing jth candidatejHigh representing the bounding box of the jth candidate;
position representation vector of candidate object
Figure BDA0003393500370000048
Comprises the following steps:
Figure BDA0003393500370000051
finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculatedlocThe similarity betweenAnd degree, taking the similarity value as the matching score of the position module, and calculating the expression as follows:
Figure BDA0003393500370000052
wherein, S (o)i|qloc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;
and 7: embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:
Figure BDA0003393500370000053
wherein, δ mijRepresents the relative position of the ith candidate object and the jth surrounding object, each candidate object has 8 surrounding objects, the surrounding object is the candidate object with the minimum Euclidean distance to the candidate object, i is 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]tl]ijDenotes the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ x [ ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]br]ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, wjWidth, h, of bounding box representing jth surrounding objectjHigh of the bounding box representing the jth surrounding object;
then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:
Figure BDA0003393500370000054
wherein the content of the first and second substances,
Figure BDA0003393500370000055
a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. ofijFeatures C4, b of jth surrounding object representing ith candidate objectrA bias representing a relationship module;
finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objectsrelThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:
Figure BDA0003393500370000061
wherein, S (o)i|qrel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;
and 8: and calculating the total matching score Si of each candidate object according to the following formula:
Si=wsub×S(oi|qsub)+wloc×S(oi|qloc)+wrel×S(oi|qrel) (15)
wherein, i is 1, 2.. times.n;
and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.
The invention has the beneficial effects that: (1) the context information understanding is more accurate. The invention adopts an end-to-end modular network, has finer-grained word weight distribution capability, and each module can focus on the word to be focused through learning, so that the model has better comprehension. (2) Not too much dependent on an external language parser. The language analysis network designed by the invention can adaptively input the designated expression, is less limited and can process expressions with various structures.
Drawings
FIG. 1 is a flow chart of a natural language visual inference method based on attention mechanism according to the present invention;
FIG. 2 is an image of the inference results obtained using the method of the present invention;
wherein, (a) -input named expression, (b) -input original image, and (c) -inference result image obtained by the invention.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.
As shown in fig. 1, the present invention provides a natural language vision reasoning method based on attention mechanism, which mainly comprises a language attention network module and three vision processing modules, and the specific implementation process is as follows:
1. language attention network module
(1) Coding each word in the input language expression into an embedded expression vector et by adopting one-hot coding, coding the context of each word by using BilSTM, connecting the obtained hidden vectors in the front and rear directions to obtain a hidden expression vector h of each wordtT represents the word serial number in the expression, and T is 1, 2.. and T represents the number of words contained in the expression;
(2) calculating the attention degree of each word by different modules according to the following formula:
Figure BDA0003393500370000071
wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, am,tRepresenting the attention of the module m to the t-th word, fmRepresenting a module m trainable vector;
the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:
Figure BDA0003393500370000072
wherein q ismPhrase embedding representing module m;
(3) the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:
Figure BDA0003393500370000073
wherein, wsubWeight, w, representing subject modulelocWeight, w, of the location modulerelRepresenting the weight of the relation module, softmax ((-)) representing a normalized exponential function for calculating the weight of each module, and Wm representing the attention degree of each module to the word; h is1Hidden representation vector, h, representing the first word in a linguistic expressionTHidden representation vector representing the last word, bmRepresents a bias;
2. input image object detection
Performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector.
3. Subject module
Combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;
dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:
Ha=tanh(WvV+Wqqsub) (19)
Figure BDA0003393500370000081
wherein HaPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, WvWeights representing the spatial grid; wqRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; wh,aRepresenting a weight of each word on the lattice; a isvIndicating the attention value of the grid;
calculating the components V of the spatial grid feature V according to the formulaiTo obtain a visual representation vector of the candidate:
Figure BDA0003393500370000082
wherein the content of the first and second substances,
Figure BDA0003393500370000083
a subject visual representation representing candidate i,
Figure BDA0003393500370000084
indicates the attention value on the ith grid, viRepresenting the characteristics of the ith grid, and G representing the number of the grids;
computing visual representation vectors
Figure BDA0003393500370000085
And phrase-embedded representation vector qsubThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:
Figure BDA0003393500370000086
wherein o isiDenotes the ith candidate, S (o)i|qsub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representsA matching function consisting of two multi-layer perceptrons and L2 regularization;
4. position module
Embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:
Figure BDA0003393500370000087
wherein liA visual representation representing the absolute position of the ith candidate, i 1,2, N being the number of candidates identified by the Mask R-CNN detector,
Figure BDA0003393500370000088
an abscissa value representing the upper left corner of the ith candidate bounding box,
Figure BDA0003393500370000089
an ordinate value representing the upper left corner of the ith candidate bounding box,
Figure BDA00033935003700000810
an abscissa value representing the lower right corner of the ith candidate object bounding box,
Figure BDA00033935003700000811
ordinate value, w, representing the lower right corner of the i-th candidate object bounding boxiWidth, h, of bounding box representing ith candidateiRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;
then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:
Figure BDA0003393500370000091
wherein, deltalijDenotes the relative position of the ith candidate object and the ith candidate object, i, j ═ 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]tl]ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]br]ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, wjWidth, h, of bounding box representing jth candidatejHigh representing the bounding box of the jth candidate;
position representation vector of candidate object
Figure BDA0003393500370000092
Comprises the following steps:
Figure BDA0003393500370000093
finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculatedlocThe similarity between the position modules is taken as a matching score of the position modules, and the calculation expression is as follows:
Figure BDA0003393500370000094
wherein, S (o)i|qloc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;
5. relationship module
Embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:
Figure BDA0003393500370000095
wherein, δ mijRepresents the relative position of the ith candidate object and the jth surrounding object, each candidate object has 8 surrounding objects, the surrounding object is the candidate object with the minimum Euclidean distance to the candidate object, i is 1,2tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]tl]ijDenotes the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ x [ ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]br]ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, wjWidth, h, of bounding box representing jth surrounding objectjHigh of the bounding box representing the jth surrounding object;
then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:
Figure BDA0003393500370000101
wherein the content of the first and second substances,
Figure BDA0003393500370000102
a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. ofijFeatures C4, b of jth surrounding object representing ith candidate objectrA bias representing a relationship module;
finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objectsrelThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:
Figure BDA0003393500370000103
wherein, S (o)i|qrel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;
6. computing visual reasoning results
And calculating the total matching score Si of each candidate object according to the following formula:
Si=wsub×S(oi|qsub)+wloc×S(oi|qloc)+wrel×S(oi|qrel) (30)
wherein, i is 1, 2.. times.n;
and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.
To verify the effectiveness of the method of the present invention, simulation experiments were performed on 1080Ti video card, the pytorech framework, and the ubuntu18 OS of the video memory 11G. The data sets used in the experiments were three published, named expression understanding data sets collected from the COCO data set, respectively: RefCOCO, RefCOCO +, RefCOCOcog. First, the Mask R-CNN is trained using a subset of the COCO dataset and the pre-trained model weights of the ResNet152 feature extraction network. The number of iterations was set at 1250000 and the learning rate was 0.001. And then, respectively extracting the characteristics of the images in the RefCOCO, RefCOCO and RefCOCOCOG data sets by using the Mask R-CNN obtained by training and storing the characteristics in a file. Finally, the extracted features are processed by the method of the invention. The calculated weights for the three visual modules are given in fig. 1 as 0.5,0.29,0.21, scoresubScore, representing the matching score of the subject modulelocMatch score, representing a location modulerelMatch score, representing a relationship moduleoverallRepresenting an overall match score. FIG. 2 shows the reasoning results of one of the images, which can prove that the invention is used in the task of understanding the named expressionsAccuracy and effectiveness of understanding contextual semantic information.

Claims (1)

1. A natural language vision reasoning method based on an attention mechanism is characterized by comprising the following steps:
step 1: encoding each word in an input language expression into an embedded representation vector e using one-hot encodingtThen using BilSTM to code the context of each word, connecting the obtained hidden vectors in the front and back directions to obtain the hidden expression vector h of each wordtT represents the word number in the expression, T is 1,2, …, T represents the number of words contained in the expression;
step 2: calculating the attention degree of each word by different modules according to the following formula:
Figure FDA0003393500360000011
wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, am,tRepresenting the attention of the module m to the t-th word, fmRepresenting a module m trainable vector;
the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:
Figure FDA0003393500360000012
wherein q ismPhrase embedding representing module m;
and step 3: the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:
Figure FDA0003393500360000013
wherein, wsubWeight, w, representing subject modulelocWeight, w, of the location modulerelRepresenting the weights of the relational modules, softmax (·) representing a normalized exponential function for calculating the weight of each module, WmRepresenting the attention degree of each module to the word; h is1Hidden representation vector, h, representing the first word in a linguistic expressionTHidden representation vector representing the last word, bmRepresents a bias;
and 4, step 4: performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector;
and 5: combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;
dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:
Ha=tanh(WvV+Wqqsub) (4)
Figure FDA0003393500360000014
wherein HaPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, WvWeights representing the spatial grid; wqRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; wh,aRepresenting a weight of each word on the lattice; a isvIndicating the attention value of the grid;
calculating the components V of the spatial grid feature V according to the formulaiTo obtain a visual representation vector of the candidate:
Figure FDA0003393500360000021
wherein the content of the first and second substances,
Figure FDA0003393500360000022
a subject visual representation representing candidate i,
Figure FDA0003393500360000023
indicates the attention value on the ith grid, viRepresenting the characteristics of the ith grid, and G representing the number of the grids;
computing visual representation vectors
Figure FDA0003393500360000024
And phrase-embedded representation vector qsubThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:
Figure FDA0003393500360000025
wherein o isiDenotes the ith candidate, S (o)i|qsub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representing a match function, consisting of two multi-layered perceptrons and L2 regularization;
step 6: embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:
Figure FDA0003393500360000026
wherein liA visual representation representing the absolute position of the ith candidate object, i ═ 1,2, …, N being MaThe sk R-CNN detector detects the number of identified candidates,
Figure FDA0003393500360000027
an abscissa value representing the upper left corner of the ith candidate bounding box,
Figure FDA0003393500360000028
an ordinate value representing the upper left corner of the ith candidate bounding box,
Figure FDA0003393500360000029
an abscissa value representing the lower right corner of the ith candidate object bounding box,
Figure FDA00033935003600000210
ordinate value, w, representing the lower right corner of the i-th candidate object bounding boxiWidth, h, of bounding box representing ith candidateiRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;
then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:
Figure FDA00033935003600000211
wherein, δ lijIndicates the relative position of the ith candidate object and the ith candidate object, i, j is 1,2, …, N, [ Δ x [ ]tl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]tl]ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]br]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]br]ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, wjRepresents the jth candidateWidth of bounding box of object, hjHigh representing the bounding box of the jth candidate;
position representation vector of candidate object
Figure FDA0003393500360000031
Comprises the following steps:
Figure FDA0003393500360000032
finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculatedlocThe similarity between the position modules is taken as a matching score of the position modules, and the calculation expression is as follows:
Figure FDA0003393500360000033
wherein, S (o)i|qloc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;
and 7: embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:
Figure FDA0003393500360000034
wherein, δ mijRepresenting the relative position representation of the ith candidate object and the jth surrounding object thereof, each candidate object having 8 surrounding objects, the surrounding object being the candidate object with the smallest euclidean distance to the candidate object, i ═ 1,2, …, N, j ═ 1,2, …,8, [ Δ x ] xtl]ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]tl]ijRepresents the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Delta ]xbr]ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]br]ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, wjWidth, h, of bounding box representing jth surrounding objectjHigh of the bounding box representing the jth surrounding object;
then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:
Figure FDA0003393500360000035
wherein the content of the first and second substances,
Figure FDA0003393500360000036
a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. ofijFeatures C4, b of jth surrounding object representing ith candidate objectrA bias representing a relationship module;
finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objectsrelThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:
Figure FDA0003393500360000037
wherein, S (o)i|qrel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;
and 8: calculating the total matching score S of each candidate object according to the following formulai
Si=wsub×S(oi|qsub)+wloc×S(oi|qloc)+wrel×S(oi|qrel) (15)
Wherein i is 1,2, …, N;
and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.
CN202111476196.8A 2021-12-06 2021-12-06 Natural language visual reasoning method based on attention mechanism Active CN114239594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111476196.8A CN114239594B (en) 2021-12-06 2021-12-06 Natural language visual reasoning method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111476196.8A CN114239594B (en) 2021-12-06 2021-12-06 Natural language visual reasoning method based on attention mechanism

Publications (2)

Publication Number Publication Date
CN114239594A true CN114239594A (en) 2022-03-25
CN114239594B CN114239594B (en) 2024-03-08

Family

ID=80753370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111476196.8A Active CN114239594B (en) 2021-12-06 2021-12-06 Natural language visual reasoning method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN114239594B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170200066A1 (en) * 2016-01-13 2017-07-13 Adobe Systems Incorporated Semantic Natural Language Vector Space
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN110390289A (en) * 2019-07-17 2019-10-29 苏州大学 Based on the video security protection detection method for censuring understanding
US20210192274A1 (en) * 2019-12-23 2021-06-24 Tianjin University Visual relationship detection method and system based on adaptive clustering learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170200066A1 (en) * 2016-01-13 2017-07-13 Adobe Systems Incorporated Semantic Natural Language Vector Space
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN110390289A (en) * 2019-07-17 2019-10-29 苏州大学 Based on the video security protection detection method for censuring understanding
US20210192274A1 (en) * 2019-12-23 2021-06-24 Tianjin University Visual relationship detection method and system based on adaptive clustering learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯星晨;王锦;: "基于自适应注意模型的图像描述", 计算机与现代化, no. 06, 15 June 2020 (2020-06-15) *
罗会兰;岳亮亮;: "跨层多模型特征融合与因果卷积解码的图像描述", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12) *

Also Published As

Publication number Publication date
CN114239594B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
Aneja et al. Convolutional image captioning
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN109508654B (en) Face analysis method and system fusing multitask and multi-scale convolutional neural network
CN112000818B (en) Text and image-oriented cross-media retrieval method and electronic device
Gupta et al. Integration of textual cues for fine-grained image captioning using deep CNN and LSTM
Mishra et al. The understanding of deep learning: A comprehensive review
Katiyar et al. Image captioning using deep stacked LSTMs, contextual word embeddings and data augmentation
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN114239612A (en) Multi-modal neural machine translation method, computer equipment and storage medium
CN114254645A (en) Artificial intelligence auxiliary writing system
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN113095405B (en) Method for constructing image description generation system based on pre-training and double-layer attention
CN112269892B (en) Based on multi-mode is unified at many levels Interactive phrase positioning and identifying method
CN112818683A (en) Chinese character relationship extraction method based on trigger word rule and Attention-BilSTM
CN112613451A (en) Modeling method of cross-modal text picture retrieval model
He et al. Image captioning algorithm based on multi-branch cnn and bi-lstm
CN116311518A (en) Hierarchical character interaction detection method based on human interaction intention information
Akman et al. Lip reading multiclass classification by using dilated CNN with Turkish dataset
CN116434058A (en) Image description generation method and system based on visual text alignment
CN114239594B (en) Natural language visual reasoning method based on attention mechanism
CN114066844A (en) Pneumonia X-ray image analysis model and method based on attention superposition and feature fusion
CN114357166A (en) Text classification method based on deep learning
Htwe et al. Building annotated image dataset for Myanmar text to image synthesis
Liu et al. Multimodal cross-guided attention networks for visual question answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant