CN112966519A

CN112966519A - Method, system and storage medium for positioning reference phrase

Info

Publication number: CN112966519A
Application number: CN202110138354.2A
Authority: CN
Inventors: 杨超; 汪国庆; 蒋斌
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-06-15
Anticipated expiration: 2041-02-01
Also published as: CN112966519B

Abstract

The invention discloses a method, a system and a storage medium for positioning a phrase, which utilize visual features modified by phrase information to generate an offer through regression, and design two loss functions to respectively control the offset and the confidence of an anchor frame, so that the generated offer has the advantages of high accuracy and high identification at the same time, and the performance bottleneck caused by low-quality offer in the prior art is relieved.

Description

Method, system and storage medium for positioning reference phrase

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a method and a system for positioning a referring phrase generated based on a phrase guide proposal.

Background

The reference to phrase location (REC) is one of the core tasks of human-computer interaction. The phrase locating task may be broadly described as locating entities associated with the expressive phrase through a bounding box (e.g., "a girl wearing a pink shirt, holding a red umbrella"). The phrase locating task is very challenging because it requires not only understanding the fine-grained semantic information of images and natural language phrases, but also aligning and associating them to locate accurately.

The basic mode of the phrase locating method is that given an input image, a certain number of area proposals are generated by a proposal generator, and then the visual features of each proposal are extracted. Meanwhile, a language model (such as Bert) encodes a corresponding phrasing phrase into a language feature. The visual features and the language features are then input to a multimodal fusion module to generate fused features. Finally, all proposed fused features are ranked using a proposed ranking module, and the proposed regions with the highest ranking scores are output as results.

Proposal generation refers to a very important step in the phrase locating process, and if this step is wrong, all the following processing flows will be wrong. Conventional proposal generation methods can be divided into non-training methods (e.g., Selective Search, Edgebox) and class-aware target detectors (e.g., fast-RCNN trained on 80-class COCO datasets). Non-training methods typically use certain features of the image itself to generate region proposals, such as edge information for the image. However, the area proposal generated by this type of method lacks discriminative properties, and thus it is difficult to hit the real area in a limited number of proposals. The class aware target detector can accurately generate high discriminative offers only if the referring object is included in a preset category, however, the vocabulary of the referring phrase is usually open, and if the referring object is not in a preset category, e.g. "window", such as not in a preset category of the COCO, the correct area offer cannot be generated. Furthermore, both of these methods have a fatal drawback in that they generate the area proposal only from the image information without considering the reference phrase, which may result in a large number of redundant proposals unrelated to the reference phrase.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method and a system for positioning a reference phrase aiming at the defects of the prior art, so as to improve the accuracy and the identification degree of the generated proposal.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for positioning a referring phrase generated based on a phrase guide offer, comprising the steps of:

s1, extracting visual features V of different scales of the input image_kFor the visual feature V_kCarrying out normalization processing;

s2, obtaining the text information Q, and modifying the text information Q on the normalized visual characteristics to obtain the visual characteristics of the text modification

S3 visual characteristic decorated in text

N anchor frames with different shapes are arranged at each position;

s4, for each anchor frame, obtaining an M-dimensional prediction vector corresponding to the anchor frame through a regression model, wherein the first M-1 values of the prediction vector at least represent the width, the height and the prediction offset of the horizontal and vertical coordinates of a central point, and the last value represents the confidence coefficient of the offset frame corresponding to the anchor frame obtained through a sigmoid function;

s5, calculating a forward propagation error of the regression model by using an overall loss function formed by the loss function for controlling the prediction offset and the loss function for controlling the confidence coefficient, solving a gradient of the forward propagation error, updating the regression model by using the gradient, and performing model training to finally obtain a trained proposed generation model.

The method utilizes visual features modified by text features to predict regional proposals through regression, and the whole process is only trained on a positioning data set of the appointed phrase, so that the proposal is generated without the limitation of preset categories; and the proposal generated by the invention has higher accuracy and higher identification through two loss functions which respectively control the offset of the anchor frame and the confidence coefficient.

The method of the present invention further comprises:

and S6, obtaining the prediction offset and confidence coefficient of all anchor frames by the trained proposal generating model, then sequentially outputting K anchor frames from high to low according to the prediction confidence coefficient of the anchor frames, and offsetting the anchor frames according to the width and height of the output anchor frames, the prediction offset of the horizontal coordinate of the central point and the prediction offset of the vertical coordinate of the central point, and finally obtaining the proposal of the positioning of the phrase. The invention outputs the proposal according to the confidence degree, thereby ensuring the accuracy of the proposal, and greatly improving the identification of the proposal due to the offset of the anchor frame.

In step S1, the visual feature V is corrected_kThe specific implementation process for carrying out the normalization processing comprises the following steps: for the visual feature V_kAnd performing convolution operation, performing normalization processing on the result of the convolution operation, and inputting the result after the normalization processing into the RELU function to obtain the visual feature after the normalization.

In step S2, visual characteristics of text embellishment

The expression of (a) is:

wherein, the product is Hadamard, (i, j) is the spatial coordinate of the visual feature after normalization,

and V_k(i, j) represent the feature components of the text-modified visual features and the original visual features at spatial locations (i, j), respectively, f₁ and f₂To learn the mapping layer, f₁() Showing the example normalization and convolution calculation, f₂() Indicates that normalization, convolution, and ReLU activation function processing are sequentially performed, and p ═ tan (W)_pQ+b_p)，q＝tan(W_qQ+b_q)；W_p、b_p、W_q、b_qAre all learnable parameters. The invention realizes the fusion of visual features and text information, and can learn the addition of a mapping layer and learnable parameters, so that the fused multi-modal features have stronger expression capability than a direct feature splicing method.

The expression of the penalty function controlling the prediction offset is:

wherein ,

the union ratio IOU (i.e., the area ratio of the intersection of the two boxes to the union) representing the intersection between the nth anchor box and the marker box in the dataset exceeds the threshold η by 1 and 0 otherwise. Loss function L of the invention_coordThe predicted values of the width, the height and the horizontal and vertical coordinate offset of the center point of the anchor frame calculated by the model can be close to the label value as much as possible.

In step S5, the loss function expression for controlling the confidence coefficient is:

wherein N is the number of anchor frames, s_iIs a predicted value of the degree of confidence,

label value for confidence. The loss function L_confThe predicted value of the confidence degree of the anchor frame calculated by the model can be close to the value of the label as much as possible.

The invention also provides a referring phrase locating system based on the phrase guide proposal generation, which comprises a computer device; the computer device is configured or programmed for performing the steps of the above-described method.

The present invention also provides a computer-readable storage medium storing a program; the program is configured for performing the steps of the above-described method.

Compared with the prior art, the invention has the beneficial effects that:

the method of the present invention solves the problem of the prior art of ignoring phrase information during the proposal generation process. The invention utilizes the visual characteristics modified by phrase information, generates the proposal through regression, and designs two loss functions to respectively control the offset and the confidence coefficient of the anchor frame, so that the generated proposal has the advantages of high accuracy and high identifiability, and the performance bottleneck caused by low-quality proposal in the prior art is relieved.

Drawings

FIG. 1 is a diagram illustrating a visual comparison between the proposed result and the results generated by two proposed methods in the present invention, wherein a solid line box is a marked area, and a dashed line box is a proposal; (a) selecting a search method result graph; (b) generating a result graph by the fast-RCNN target detector; (c) the result of the method is shown;

FIG. 2 is a flowchart illustrating the steps of a phrase location offer generation process according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention is realized by the following specific steps:

first, the image visual feature V at three different resolutions was extracted using a Darknet53 feature pyramid network pre-trained on the COCO target detection dataset (Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla' r, and C Lawrence Zitnick, "Microsoft COCO: Common objects in context", in ECCV, 2014.)_k(k ═ 1,2, 3), while using the language model to extract the text features Q that refer to the phrase. In order to introduce the text information of the phrase in the proposal generation process, the invention modifies all visual characteristics of the text information to obtain the visual characteristics of the text modification

The model is then trained with two loss functions that control the anchor frame offset and confidence, respectively. Finally, for models

Regression prediction of proposed area position coordinates is performed.

In order to obtain visual features containing phrase information, the present invention modifies visual features at three different resolutions with text features using a conditional normalization method used in the image-level task (h.de Vries, f.sub, j.mary, h.larochelle, o.pietquin, and a.courville, "Modulating early visual processing by language," in ANIPS, 2017.). First, a text feature Q is mapped into a scaling vector p and a translation vector Q by two multi-layer perceptrons:

p＝tan(W_pQ+b_p)，q＝tan(W_qQ+b_q)；

wherein ,W_p、b_p、W_q、b_qAre learnable parameters. Then, the text modification visual characteristics are obtained through the formula (1)

Wherein, l is a Hadamard product, (i, j) is a spatial coordinate of a visual feature, f₁ and f₂The learnable mapping layer shown in fig. 2. f. of₁Composed of example normalized (instant Normalization) layers plus the following 1X 1 convolutional layers, f₂Consisting of a batch normalization (batch normalization) layer plus the following 3 × 3 convolutional layer and the ReLU activation function. The same operation is performed for each coordinate position (i, j) of the visual feature.

As shown in stage 1 of FIG. 2, the orientation module embellishes the visual features with text

As an input, a plurality of image area position coordinates are output as a proposal. There are a total of 8 × 8+16 × 16+32 × 32 — 1344 different positions, each corresponding to a 512-dimensional vector of text-modifying visual features. Each of the visual features is modified in text by a copy of Yoloov 3(J.Redmon and A.Farhadi, "Yolov 3: An innovative improvement," arXiv preprint arXiv:1804.02767,2018.)Three Anchor frames (anchors) are arranged at each position, the size of each Anchor frame is obtained by taking (1-IOU) as a distance and clustering through K-means according to the width and height of a mark frame of a real area in a training set (which can be any phrase positioning data set). Therefore, the invention has the same number of 4032 anchor frames as N3 × 1344. For each anchor frame, 4 values are predicted through regression, and the 4 values are respectively the width, the height and the offset of the horizontal and vertical coordinates of the central point of the anchor frame, and a fifth value representing the confidence coefficient of the corresponding offset frame is obtained through a sigmoid function.

For the prediction of the confidence value, the key is how to design the loss function so that the predicted confidence value S and the confidence value label S^*And (5) the consistency is achieved. Most of the existing methods (Z.Yang, B.Gong, L.Wang, W.Huang, D.Yu, and J.Luo, "A fast and acid one-stage approach to visual grouping," in ICCV, 2019.; X.Liu, Z.Wang, J.Shao, X.Wang, and H.Li, "Improving relating expressing grouping with one-stage association-determined grouping," in CVPR, 2019.) define the confidence value

This is a one-hot vector with only one element of 1, setting the confidence label to 1 when the IOU of the anchor box is maximum with the real marker box, otherwise to 0. And then, training the model by utilizing the one-hot label to realize a cross entropy loss function.

The present invention improves upon the above-described process. Specifically, the invention sets a threshold η, then calculates the IOU score between each anchor box and the true mark box, sets the IOU score as the label of confidence if the score exceeds the threshold η 0.7, and sets the IOU score as 0 otherwise. Thus, the present invention results in a smooth confidence label

An L1 regularization is then performed to ensure Σ S^*1. Thus, the present invention uses Kullback-Leibler divergence (KLD) asThe loss function of the present invention causes the predicted confidence score S to approximate a smooth confidence label S^*As shown in formula (3). Note that S is also L1 regularized to satisfy the same requirement as S^*The same probability distribution.

The benefits of a smooth label are as follows: 1) besides the anchor frame with the largest IOU score with the real frame, other anchor frames can provide useful context information; 2) the smooth labels are actually regularizing the model, mitigating overfitting.

For the prediction of the anchor frame offset, the invention uses the MSE loss function as follows:

wherein

Indicating whether the IOU between the nth anchor box and the true mark box exceeds a threshold η, σ denotes the sigmoid function. Note that the penalty function only works for the selected anchor frame that will be the proposal frame by the offset.

The overall loss function of the proposed generative model is defined as:

L＝L_conf+γL_coord，

where γ is 1, a trade-off parameter.

In the testing stage, the proposed number is set to be K, and then the shifted anchor frames are sequentially output as the proposals according to the confidence coefficient of the anchor frames from high to low. To avoid outliers, the present invention eliminates boxes that are extremely small in height or width. It is noted that although the method of the present invention is similar to the single-stage REC method (x.chen, l.ma, j.chen, z.jie, w.liu, and j.luo, "Real-time regression expression by single-stage grouping network," arXiv prediction xiv:1812.03426,2018.; a.sadhu, k.chen, and r.new, "Zero-shot surroundings of objects from natural language queries," in ICCV,2019.), which also recently employs a deep regression network, the motivation of the present invention is different. The single-stage phrase positioning method directly outputs the unique region with the highest confidence as the final result, while the model of the invention outputs a plurality of regions as proposals.

In the invention, the proposed proposal generation model is an end-to-end network, an image and a corresponding expression phrase are input, and then a series of image areas are output as a proposal. For images, the invention first resizes the original image to 256 × 256, and then extracts visual features using a Darknet53(J.Redmon and A.Farhadi, "Yolov 3: An innovative improvement," arXiv predictive: 1804.02767,2018 ") feature pyramid network pre-trained on the COCO target detection dataset. The visual features extracted have 3 spatial resolutions, 8 × 8 × D1, 16 × 16 × D2, and 32 × 32 × D3, respectively. Where D1-1024, D2-512, and D3-256 are the number of channels corresponding to the features at the resolution. Finally, it is mapped to the uniform dimension D512 by a 1 × 1 convolution layer and batch normalization and RELU function. For the phrase reference, the present invention embeds it into a 768-dimensional vector by uncased version of Bert (J.Devrlin, M.Chang, K.Lee, and K.Toutanova, "Bert: Pre-training of deep biological transformations for language understating," arXiv preprint arXiv:1810.04805,2018 "), and then obtains 512-dimensional text features through two 512-neuron fully connected layers. Since spatial location information encoding may be better implemented in the proposed representation and ordering module, the present invention does not repeat this operation in the proposed generative model.

To implement a complete phrase localization model, the present invention uses a Modular network to implement the proposed representation and ordering, following MattNet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "MattNet: Modular association network for transfer-ring expression compression," in CVPR, 2018.). As shown in the second stage of fig. 2, the MAttNet uses three sub-modules related to appearance, position and relationship. The appearance submodule handles attributes such as category, color, etc., the position submodule handles absolute position and relative position, and the relationship submodule handles body-object relationships. Each module has a different structure, and respective learning parameters are in the module space of the module so as not to influence each other. The module does not use an external language parser, but instead automatically learns to parse phrases through a soft attention mechanism. The model measures the matching degree between the object and the referring phrase by calculating the matching scores of the three sub-modules.

When resizing the input image, the invention preserves the scale of the original image and resizes its long side to 256 pixels, and then fills the average of the image pixels along the short side. The method adopts a RMSProp optimization method to train the model. Invention 10^-4The polynomial learning rate attenuation strategy with power of 1 is adopted. Since Darknet has been pre-trained, the present invention reduces the learning rate of the Darknet part in the model by 90%. The invention sets the IOU threshold eta to be 0.7 and the trade-off parameter gamma of the loss function to be 1. The present invention selects K-7 as the default number for generating the proposal, and relevant ablation experimental studies can be seen in the quantitative results analysis section. The batch size of all experiments in the invention was 32, and training was done on a single 1080Ti GPU. The training setup for the presentation and sequencing module is proposed to be the same as for the MAttNet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "Mattnet: modulation association network for transfer-ring expression," in CVPR, 2018.).

In the present invention, for a given phrase of reference, a prediction box is considered correct if the IOU between the prediction box and the true mark-up box is not less than 0.5.

In the simulation experiments of the present invention, 4 sets of classical phrase positioning data were used, including, for example, Refrit (S. Kazemzadeh, V.Ordonz, M.Matten, and T.berg, "Referitgame: Referring to objects in photographs of natural scenes," in EMNLP,2014.), REFCOCO (L.Yu, P.Poirson, S.Yang, A.berg, and T.berg, "Modeling context in referencing expressions," in ECCV "), REFCOCO + (L.Yu, P.Poirson, S.Yang, A.berg, and T.berg," Modeling context in referencing expressions, "in ECCV"), and COCOG (J.J.Huuang, A.Touchouo, C.CVyield, and T.berg, "in linking context in referencing expressions," in Recordv, "and COG.2016). The Refrit contains 20000 images from SAIAPR-12[27], and the invention applies a clean split version of a training, validation and test set containing 9000, 1000 and 10000 images, respectively (K.Chen, R.Kovvuri, and R.Nevatia, "Query-guided regression network with context polarity for graphics grouping," in ICCV, 2017.). RefCOCO marks 50,000 target entities from 19,994 pictures. RefCOCO + 49,856 target entities were marked from 19,992 pictures. The two data sets are divided into four parts, train, val, testA and testB. RefCOCOg contains 49,822 target entities from 25799 images, which are divided into three parts, train, val, and test.

Table 1 comparison of performance of different phrase location methods on RefCOCO, RefCOCO +, and refcocg datasets (acc @ 0.5%). The method of labelling with (×) is a single stage method.

Table 2 comparison of performance (acc @ 0.5%) of different phrase location methods on the referent dataset. The method of labelling with (×) is a single stage method.

Table 3 suggests a generator performance comparison (acc @ 0.5%).

TABLE 4 ablation study results for proposed number (acc @ 0.5%)

Referring to the phrase positioning result: the process of the present invention was compared to other SOTA processes. Tables 1 and 2 show the positioning results of the phrase references on the COCO series data set and the reference data set, respectively. The results show that the model of the invention is superior to the existing two-stage SOTA method and the single-stage SOTA method. Particularly for the reference dataset which is not a COCO series dataset, the method of the invention has far more effect than the existing two-stage method. In addition, the table also shows the performance of the model trained under different loss functions, and it can be seen that the training under the KLD loss function improves the model performance by 0.6 to 1.8 points compared with the conventional method using the cross entropy loss function.

Performance comparison of the proposed generators: in order to further study the performance of the proposed generators in the present invention, the present invention used the proposed representation and ranking method in MAttnet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "Mattnet: modulation association network for transfer-ring expression complex," in CVPR,2018.) and only the proposed generators were changed for comparative experiments. The present invention compares three mainstream proposed generators, namely Edgebox (C.Zitnick and P.Dolla r, "Edge boxes: Locating objects from streams," in ECCV, 2014.), Select Search (J.Uijlinks, K.Van De Sande, T.Gevers, and A.Smeulders, "selecting Search for object registration," IJCV, 2013 "), and FRCN target detector (S.Ren, K.He, R.Girshick, and J.Sunn," fast Rcnn: war real-time object detection with prediction probes, "TPAMI, network"), with the final results shown in Table 3. It can be seen that the method proposed by the present invention outperforms the other methods on all datasets. FRCN trained on COCO also performs well on COCO series data sets (RefCOCO, RefCOCO +). For example, FRCN performs as well in val and testA of RefCOCO + as the proposed generation model of the present invention, although the model of the present invention is significantly superior to FRCN in testB of RefCOCOz and RefCOCO +. However, in the refer dataset, the performance of FRCN drops significantly. This is because RefCOCO/RefCOCO + is a subset of COCO and sharing the image and target entity, a target detector trained on COCO can generate an almost perfect region proposal on the COCO series dataset. However, when the present invention applies it to other data sets, such as the referent data set, their performance will be significantly degraded. However, the proposed generator works well on all datasets in the present invention.

Ablation experiment: the present invention performs ablation studies on the referent dataset on different proposed production quantities K, the results of which are shown in table 4. The present invention observes that increasing the number of generated proposals after a certain threshold is reached can no longer improve the accuracy of the model (e.g., K ≧ 7 at Refert). Therefore, in the experiments of the present invention, K ═ 7 was chosen as the default value by the present invention. The reason for the analysis of the present invention is that since the method of the present invention generates the proposal according to the confidence of the anchor box from high to low, when K is 7, the hit rate to the real tag box is already quite high. If K is further increased, the increase in hit rate will not be sufficient to offset the negative effects of the proposed redundancy.

Claims

1. A method for locating a phrase, comprising the steps of:

S3 visual characteristic decorated in text

N anchor frames with different shapes are arranged at each position;

s4, for each anchor frame, obtaining a 5-dimensional prediction vector corresponding to the anchor frame through a regression model, wherein the first 4 values of the prediction vector at least represent the width, the height and the prediction offset of the horizontal and vertical coordinates of a central point, and the last value represents the confidence coefficient of the anchor frame corresponding to the offset frame obtained through a sigmoid function;

s5, using a loss function L for controlling the prediction offset_coordAnd a loss function L controlling said confidence_confAnd the constructed integral loss function L calculates the forward propagation error of the regression model, calculates the gradient of the forward propagation error, updates the regression model by using the gradient to carry out model training, and finally obtains a trained proposed generation model.

2. The phrase location method as recited in claim 1, further comprising:

and S6, obtaining the prediction offset and confidence coefficient of all anchor frames by the trained proposal generating model, then sequentially outputting K anchor frames from high to low according to the prediction confidence coefficient of the anchor frames, and offsetting the anchor frames according to the width and height of the output anchor frames, the prediction offset of the horizontal coordinate of the central point and the prediction offset of the vertical coordinate of the central point, and finally obtaining the proposal of the positioning of the phrase.

3. The method for locating phrases according to claim 1, wherein in step S1, the visual feature V is determined_kThe specific implementation process for carrying out the normalization processing comprises the following steps: for the visual feature V_kAnd performing convolution operation, performing normalization processing on the result of the convolution operation, and inputting the result after the normalization processing into the RELU function to obtain the visual feature after the normalization.

4. The method for locating phrases according to claim 1, wherein in step S2, the visual characteristics of the text modification

The expression of (a) is:

wherein, l is a Hadamard product, and (i, j) is normalizedThe spatial coordinates of the visual features after the transformation,

and V_k(i, j) represent the feature components of the text-modified visual features and the original visual features at spatial locations (i, j), respectively, f₁() and f₂() To learn the mapping layer, f₁() Showing the example normalization and convolution calculation, f₂() Indicates that normalization, convolution, and ReLU activation function processing are sequentially performed, and p ═ tan (W)_pQ+b_p),q＝tan(W_,Q+b_,)；W_p、b_p、W_,、b_,Are all learnable parameters.

5. The method of claim 4, wherein f is₁() Comprises an example normalization layer and a 1 × 1 convolution layer which are connected in sequence; f. of₂() Comprises a batch normalization layer, a 3 x 3 convolution layer and a ReLU activation function which are connected in sequence.

6. The method for locating a phrase as claimed in claim 1, wherein in step S5, the expression of the loss function for controlling the prediction offset is:

wherein ,

indicating the intersection ratio IOU between the nth anchor frame and the marker frame in the dataset,

the value is 1 when the threshold eta is exceeded, otherwise, the value is 0; σ denotes sigmoid function, (t)_w)_n、(t₆)_n、(t_x)_n、(t_y)_nIs the width and height of the nth anchor frameDegree, predicted offset of the abscissa and predicted offset of the ordinate of the center point,

and N is the width, height, and offset label of the abscissa and ordinate of the center point of the nth anchor frame, wherein N is 1,2, … and N.

7. The method for locating a phrase as claimed in claim 1, wherein in step S5, the loss function expression for controlling the confidence coefficient is:

label value for confidence.

8. The method as claimed in any one of claims 1 to 7, wherein in step S5, L ═ γ L_coord+L_conf(ii) a γ is a trade-off parameter.

9. A phrase location system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 8.

10. A computer-readable storage medium characterized by storing a program; the program is configured for carrying out the steps of the method according to one of claims 1 to 8.