CN114239594A

CN114239594A - Natural language visual reasoning method based on attention mechanism

Info

Publication number: CN114239594A
Application number: CN202111476196.8A
Authority: CN
Inventors: 王�琦; 许杰; 袁媛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-25
Anticipated expiration: 2041-12-06
Also published as: CN114239594B

Abstract

The invention provides a natural language visual reasoning method based on an attention mechanism. Firstly, inputting a language expression, processing by utilizing one-hot codes, BilSTM codes and the like, and calculating phrase embedded expression and weight for the three visual processing modules according to the language expression; then, performing target detection on the input image by using a Mask R-CNN detector, and respectively inputting the detection result into a subject module, a position module and a relation module, wherein each module respectively calculates the matching score; and finally, calculating the weighted sum of the matching scores of the three modules as an overall matching score, taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image. The invention has better context information comprehension and can process expressions with various structures.

Description

Natural language visual reasoning method based on attention mechanism

Technical Field

The invention belongs to the technical field of computer vision and natural language processing, and particularly relates to a natural language vision reasoning method based on an attention mechanism.

Background

The expression understanding refers to positioning an object region described in a natural language in an image. Namely: inputting a picture (containing people or other objects), inputting a natural language description (named expression) capable of identifying a specific object in the picture, wherein the description is an English word, phrase or sentence, and can contain attributes such as the category, position, color, size and relation with surrounding objects of the object. It is required to locate the region of the object described in the picture (the object is framed and segmented by a bounding box). The expression interpretation is a meaningful task that can be applied to image retrieval, such as finding objects with specific attributes in a picture library. In addition, the named expression understanding is also an important technology for machine understanding of the real world and communicating with human like human, and can be applied to a visual understanding and conversation system of modern intelligent equipment.

Mao et al in the literature "J.Mao, J.Huang, A.Toshev, O.Camburg, L.Yuille, and K.Murphy," Generation and compliance of unambiguous object descriptors, "Proc.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.11-20, 2016" use the long-short-time memory network LSTM structure to establish the probability model P (r | o), and find the probability-maximizing object o. A set of candidate regions is first generated and then ranked according to probability. Rohrbach et al in the literature "Rohrbach, m.rohrbach, r.hu, t.darell, and b.schiole," grouping of textual phrases in images by y retrieval, "proc.european Conference on Computer Vision (ECCV), pp.817-834,2016," use joint embedding models to directly calculate P (r | o), learn image-text embedding using a two-view neural network, followed by two non-linear layers of image-text representations that can be obtained by two pre-trained networks and an off-the-shelf feature extraction network. In combination with the above two methods, L.Yu et al, in the literature "L.Yu, H.Tan, M.Bansal, and L.berg," A joint nozzle-inside-resistor model for the feedback expressions, "Proc.IEEE Conference on Computer Vision and Pattern Registration (CVPR)," pp.7282-7290,2017 "proposed a model combining CNN-LSTM and an embedding model to achieve better performance. The model can jointly learn the "speaker" model of CNN-LSTM and the "listener" model based on embedding, for the task of generation and understanding of the named expressions. In addition, a reward-based authentication enhancer is added to guide the sampling of the more discriminative expression, further improving the system. The model does not work independently, but rather allows "speakers", "listeners" and "reinforcers" to interact, thereby improving the performance of generating and understanding tasks. However, the method has insufficient understanding of the context information of the named expression, and the final positioning result is inaccurate.

Disclosure of Invention

In order to overcome the problem of insufficient or inaccurate understanding of the context information of the named expression in the prior art, the invention provides a natural language visual reasoning method based on an attention mechanism. Firstly, inputting a language expression, processing the language expression by utilizing one-hot codes, BilSTM codes and the like, and calculating phrase embedded representation and weight for the three visual processing modules according to the language expression; then, performing target detection on the input image by using a Mask R-CNN detector, taking the detected target as a candidate object of the image, and respectively inputting the target into a subject module, a position module and a relation module, wherein each module respectively calculates to obtain a corresponding matching score; and finally, calculating the weighted sum of the matching scores of the three modules as an overall matching score, taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image. The invention adopts an end-to-end modular network, each module can pay attention to the word to be paid attention through learning, has better context information comprehension, can adaptively input the designated expression and can process expressions with various structures.

A natural language vision reasoning method based on an attention mechanism is characterized by comprising the following steps:

step 1: encoding each word in an input language expression into an embedded representation vector e using one-hot encoding_tThen using BilSTM to code the context of each word, and connecting the obtained hidden vectors in front and back directionsNext, a hidden representation vector h for each word is obtained_tT represents the word serial number in the expression, and T is 1, 2.. and T represents the number of words contained in the expression;

step 2: calculating the attention degree of each word by different modules according to the following formula:

wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, a_m，tRepresenting the attention of the module m to the t-th word, f_mRepresenting a module m trainable vector;

the weighted sum of the word-embedded representation vectors is calculated as the phrase-embedded representation vector for each module as follows:

wherein q is^mPhrase embedding representing module m;

and step 3: the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:

wherein, w_subWeight, w, representing subject module_locWeight, w, of the location module_relRepresenting the weights of the relational modules, softmax (·) representing a normalized exponential function for calculating the weight of each module, W_mRepresenting the attention degree of each module to the word; h is₁Hidden representation vector, h, representing the first word in a linguistic expression_THidden representation vector representing the last word, b_mRepresents a bias;

and 4, step 4: performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector;

and 5: combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;

dividing the subject characteristics into 14 × 14 spatial grids, and calculating the similarity of the phrase embedding expression vector of the subject module and each grid, wherein the calculation expression of the process is as follows:

H_a＝tanh(W_vV+W_qq^sub) (4)

wherein H_aPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, W_vWeights representing the spatial grid; w_qRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; w_h，aRepresenting a weight of each word on the lattice; a is^vIndicating the attention value of the grid;

calculating the components V of the spatial grid feature V according to the formula_iTo obtain a visual representation vector of the candidate:

wherein the content of the first and second substances,

a subject visual representation representing candidate i,

indicating the attention value on the ith grid,v_irepresenting the characteristics of the ith grid, and G representing the number of the grids;

computing visual representation vectors

And phrase-embedded representation vector q^subThe similarity between the two modules is taken as the matching score of the subject module, and the calculation expression is as follows:

wherein o is_iDenotes the ith candidate, S (o)_i|q^sub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representing a match function, consisting of two multi-layered perceptrons and L2 regularization;

step 6: embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:

wherein l_iA visual representation representing the absolute position of the ith candidate, i 1,2, N being the number of candidates identified by the Mask R-CNN detector,

an abscissa value representing the upper left corner of the ith candidate bounding box,

an ordinate value representing the upper left corner of the ith candidate bounding box,

representing the lower right corner of the ith candidate bounding boxThe value of the abscissa of (a) is,

ordinate value, w, representing the lower right corner of the i-th candidate object bounding box_iWidth, h, of bounding box representing ith candidate_iRepresents the height of the bounding box of the ith candidate, W represents the width of the input image, and H represents the height of the input image;

then, the relative position representation of the candidate object is encoded by calculating the offset and area ratio:

wherein, δ l_ijDenotes the relative position of the ith candidate object and the ith candidate object, i, j ═ 1,2_tl]_ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]_tl]_ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]_br]_ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]_br]_ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, w_jWidth, h, of bounding box representing jth candidate_jHigh representing the bounding box of the jth candidate;

position representation vector of candidate object

Comprises the following steps:

finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculated^locThe similarity betweenAnd degree, taking the similarity value as the matching score of the position module, and calculating the expression as follows:

wherein, S (o)_i|q^loc) A visual representation representing the location of the ith candidate object and a location phrase embedded match score;

and 7: embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:

wherein, δ m_ijRepresents the relative position of the ith candidate object and the jth surrounding object, each candidate object has 8 surrounding objects, the surrounding object is the candidate object with the minimum Euclidean distance to the candidate object, i is 1,2_tl]_ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]_tl]_ijDenotes the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ x [ ]_br]_ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]_br]_ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, w_jWidth, h, of bounding box representing jth surrounding object_jHigh of the bounding box representing the jth surrounding object;

then, the visual representation of the relationship of each candidate object and its surrounding objects is calculated as follows:

wherein the content of the first and second substances,

a visual representation representing the relationship of the ith candidate object and its jth surrounding object, Wr (-) representing the weight of the relationship module; v. of_ijFeatures C4, b of jth surrounding object representing ith candidate object_rA bias representing a relationship module;

finally, a relational visual representation and phrase-embedded representation vector q is computed for each candidate object and its surrounding objects^relThe similarity between the two modules is the matching score of the relation module with the maximum similarity value, namely:

wherein, S (o)_i|q^rel) A matching score representing a relational visual representation and relational phrase embedding of the ith candidate object and its surrounding objects;

and 8: and calculating the total matching score Si of each candidate object according to the following formula:

S_i＝w_sub×S(o_i|q^sub)+w_loc×S(o_i|q^loc)+w_rel×S(o_i|q^rel) (15)

wherein, i is 1, 2.. times.n;

and taking the candidate object with the highest overall matching score as an object described by the language expression, outputting a position frame of the candidate object, and finishing the visual reasoning of the image.

The invention has the beneficial effects that: (1) the context information understanding is more accurate. The invention adopts an end-to-end modular network, has finer-grained word weight distribution capability, and each module can focus on the word to be focused through learning, so that the model has better comprehension. (2) Not too much dependent on an external language parser. The language analysis network designed by the invention can adaptively input the designated expression, is less limited and can process expressions with various structures.

Drawings

FIG. 1 is a flow chart of a natural language visual inference method based on attention mechanism according to the present invention;

FIG. 2 is an image of the inference results obtained using the method of the present invention;

wherein, (a) -input named expression, (b) -input original image, and (c) -inference result image obtained by the invention.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the present invention provides a natural language vision reasoning method based on attention mechanism, which mainly comprises a language attention network module and three vision processing modules, and the specific implementation process is as follows:

1. language attention network module

(1) Coding each word in the input language expression into an embedded expression vector et by adopting one-hot coding, coding the context of each word by using BilSTM, connecting the obtained hidden vectors in the front and rear directions to obtain a hidden expression vector h of each word_tT represents the word serial number in the expression, and T is 1, 2.. and T represents the number of words contained in the expression;

(2) calculating the attention degree of each word by different modules according to the following formula:

wherein q is^mPhrase embedding representing module m;

(3) the hidden representation vector, which connects the first word and the last word, is converted into weights for three modules using a full connection layer, as follows:

wherein, w_subWeight, w, representing subject module_locWeight, w, of the location module_relRepresenting the weight of the relation module, softmax ((-)) representing a normalized exponential function for calculating the weight of each module, and Wm representing the attention degree of each module to the word; h is₁Hidden representation vector, h, representing the first word in a linguistic expression_THidden representation vector representing the last word, b_mRepresents a bias;

2. input image object detection

Performing target detection on an input image by using a Mask R-CNN detector, and taking a target obtained by detection as a candidate object of the image; wherein, a residual error network is adopted as a characteristic extraction network of a Mask R-CNN detector.

3. Subject module

Combining the feature C3 output by the residual error network conv3_ x module and the feature C4 output by the conv4_ x module through convolution of 1 multiplied by 1 to obtain subject features, and inputting the subject features into an attribute prediction branch in the subject module to obtain predicted attributes;

H_a＝tanh(W_vV+W_qq^sub) (19)

wherein the content of the first and second substances,

a subject visual representation representing candidate i,

indicates the attention value on the ith grid, v_iRepresenting the characteristics of the ith grid, and G representing the number of the grids;

computing visual representation vectors

wherein o is_iDenotes the ith candidate, S (o)_i|q^sub) A visual representation of the subject representing the ith candidate and a subject phrase-embedded match score, F (-) representsA matching function consisting of two multi-layer perceptrons and L2 regularization;

4. position module

Embedding and inputting the visual representation of the position of the candidate object and the position phrase into a position module, firstly adopting 5-dimensional vectors to code the position of the upper left corner, the position of the lower right corner and the relative area of the candidate object and the image:

an abscissa value representing the lower right corner of the ith candidate object bounding box,

wherein, delta_lijDenotes the relative position of the ith candidate object and the ith candidate object, i, j ═ 1,2_tl]_ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]_tl]_ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]_br]_ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]_br]_ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, w_jWidth, h, of bounding box representing jth candidate_jHigh representing the bounding box of the jth candidate;

position representation vector of candidate object

Comprises the following steps:

finally, a position representation vector and a phrase embedding representation vector q of the candidate object are calculated^locThe similarity between the position modules is taken as a matching score of the position modules, and the calculation expression is as follows:

5. relationship module

Embedding and inputting the visual representation of the relation of the candidate objects and the relation phrases into a relation module, firstly coding the relative position representation of the surrounding objects to the candidate objects:

wherein the content of the first and second substances,

6. computing visual reasoning results

And calculating the total matching score Si of each candidate object according to the following formula:

S_i＝w_sub×S(o_i|q^sub)+w_loc×S(o_i|q^loc)+w_rel×S(o_i|q^rel) (30)

wherein, i is 1, 2.. times.n;

To verify the effectiveness of the method of the present invention, simulation experiments were performed on 1080Ti video card, the pytorech framework, and the ubuntu18 OS of the video memory 11G. The data sets used in the experiments were three published, named expression understanding data sets collected from the COCO data set, respectively: RefCOCO, RefCOCO +, RefCOCOcog. First, the Mask R-CNN is trained using a subset of the COCO dataset and the pre-trained model weights of the ResNet152 feature extraction network. The number of iterations was set at 1250000 and the learning rate was 0.001. And then, respectively extracting the characteristics of the images in the RefCOCO, RefCOCO and RefCOCOCOG data sets by using the Mask R-CNN obtained by training and storing the characteristics in a file. Finally, the extracted features are processed by the method of the invention. The calculated weights for the three visual modules are given in fig. 1 as 0.5,0.29,0.21, score_subScore, representing the matching score of the subject module_locMatch score, representing a location module_relMatch score, representing a relationship module_overallRepresenting an overall match score. FIG. 2 shows the reasoning results of one of the images, which can prove that the invention is used in the task of understanding the named expressionsAccuracy and effectiveness of understanding contextual semantic information.

Claims

1. A natural language vision reasoning method based on an attention mechanism is characterized by comprising the following steps:

step 1: encoding each word in an input language expression into an embedded representation vector e using one-hot encoding_tThen using BilSTM to code the context of each word, connecting the obtained hidden vectors in the front and back directions to obtain the hidden expression vector h of each word_tT represents the word number in the expression, T is 1,2, …, T represents the number of words contained in the expression;

wherein, m belongs to { sub, loc, rel }, m-sub represents the subject module, m-loc represents the position module, m-rel represents the relation module, a_m,tRepresenting the attention of the module m to the t-th word, f_mRepresenting a module m trainable vector;

wherein q is^mPhrase embedding representing module m;

H_a＝tanh(W_vV+W_qq^sub) (4)

wherein H_aPhrase embedding representing a subject module on a spatial grid, tanh (·) representing a tanh activation function, W_vWeights representing the spatial grid; w_qRepresenting the attention degree of the subject language module to each word; v represents a feature of the spatial grid; w_h,aRepresenting a weight of each word on the lattice; a is^vIndicating the attention value of the grid;

wherein the content of the first and second substances,

a subject visual representation representing candidate i,

computing visual representation vectors

wherein l_iA visual representation representing the absolute position of the ith candidate object, i ═ 1,2, …, N being MaThe sk R-CNN detector detects the number of identified candidates,

wherein, δ l_ijIndicates the relative position of the ith candidate object and the ith candidate object, i, j is 1,2, …, N, [ Δ x [ ]_tl]_ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]_tl]_ijDenotes the absolute value of the difference between the ordinate values at the upper left corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ x ]_br]_ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, [ Δ y ]_br]_ijRepresents the absolute value of the difference between the ordinate values of the lower right corner of the bounding box of the ith candidate object and the jth candidate object, w_jRepresents the jth candidateWidth of bounding box of object, h_jHigh representing the bounding box of the jth candidate;

position representation vector of candidate object

Comprises the following steps:

wherein, δ m_ijRepresenting the relative position representation of the ith candidate object and the jth surrounding object thereof, each candidate object having 8 surrounding objects, the surrounding object being the candidate object with the smallest euclidean distance to the candidate object, i ═ 1,2, …, N, j ═ 1,2, …,8, [ Δ x ] x_tl]_ijRepresents the absolute value of the difference between the abscissa values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Δ y [ ]_tl]_ijRepresents the absolute value of the difference between the ordinate values of the upper left corner of the bounding box of the ith candidate object and its jth surrounding object, [ Delta ]x_br]_ijRepresents the absolute value of the difference between the abscissa values of the lower right corner of the bounding box of the ith candidate object and the jth surrounding object, [ Δ y [ ]_br]_ijRepresenting the absolute value of the difference between the ordinate values in the lower right hand corner of the bounding box of the ith candidate object and its jth surrounding object, w_jWidth, h, of bounding box representing jth surrounding object_jHigh of the bounding box representing the jth surrounding object;

wherein the content of the first and second substances,

and 8: calculating the total matching score S of each candidate object according to the following formula_i：

S_i＝w_sub×S(o_i|q^sub)+w_loc×S(o_i|q^loc)+w_rel×S(o_i|q^rel) (15)

Wherein i is 1,2, …, N;