CN112966519A - Method, system and storage medium for positioning reference phrase - Google Patents
Method, system and storage medium for positioning reference phrase Download PDFInfo
- Publication number
- CN112966519A CN112966519A CN202110138354.2A CN202110138354A CN112966519A CN 112966519 A CN112966519 A CN 112966519A CN 202110138354 A CN202110138354 A CN 202110138354A CN 112966519 A CN112966519 A CN 112966519A
- Authority
- CN
- China
- Prior art keywords
- phrase
- offset
- visual
- anchor
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000003860 storage Methods 0.000 title claims abstract description 5
- 230000000007 visual effect Effects 0.000 claims abstract description 46
- 230000006870 function Effects 0.000 claims abstract description 38
- 238000010606 normalization Methods 0.000 claims description 22
- 230000014509 gene expression Effects 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 10
- 238000012986 modification Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 239000003550 marker Substances 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000013461 design Methods 0.000 abstract description 3
- 238000002679 ablation Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 239000002253 acid Substances 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a method, a system and a storage medium for positioning a phrase, which utilize visual features modified by phrase information to generate an offer through regression, and design two loss functions to respectively control the offset and the confidence of an anchor frame, so that the generated offer has the advantages of high accuracy and high identification at the same time, and the performance bottleneck caused by low-quality offer in the prior art is relieved.
Description
Technical Field
The invention relates to the technical field of human-computer interaction, in particular to a method and a system for positioning a referring phrase generated based on a phrase guide proposal.
Background
The reference to phrase location (REC) is one of the core tasks of human-computer interaction. The phrase locating task may be broadly described as locating entities associated with the expressive phrase through a bounding box (e.g., "a girl wearing a pink shirt, holding a red umbrella"). The phrase locating task is very challenging because it requires not only understanding the fine-grained semantic information of images and natural language phrases, but also aligning and associating them to locate accurately.
The basic mode of the phrase locating method is that given an input image, a certain number of area proposals are generated by a proposal generator, and then the visual features of each proposal are extracted. Meanwhile, a language model (such as Bert) encodes a corresponding phrasing phrase into a language feature. The visual features and the language features are then input to a multimodal fusion module to generate fused features. Finally, all proposed fused features are ranked using a proposed ranking module, and the proposed regions with the highest ranking scores are output as results.
Proposal generation refers to a very important step in the phrase locating process, and if this step is wrong, all the following processing flows will be wrong. Conventional proposal generation methods can be divided into non-training methods (e.g., Selective Search, Edgebox) and class-aware target detectors (e.g., fast-RCNN trained on 80-class COCO datasets). Non-training methods typically use certain features of the image itself to generate region proposals, such as edge information for the image. However, the area proposal generated by this type of method lacks discriminative properties, and thus it is difficult to hit the real area in a limited number of proposals. The class aware target detector can accurately generate high discriminative offers only if the referring object is included in a preset category, however, the vocabulary of the referring phrase is usually open, and if the referring object is not in a preset category, e.g. "window", such as not in a preset category of the COCO, the correct area offer cannot be generated. Furthermore, both of these methods have a fatal drawback in that they generate the area proposal only from the image information without considering the reference phrase, which may result in a large number of redundant proposals unrelated to the reference phrase.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a system for positioning a reference phrase aiming at the defects of the prior art, so as to improve the accuracy and the identification degree of the generated proposal.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for positioning a referring phrase generated based on a phrase guide offer, comprising the steps of:
s1, extracting visual features V of different scales of the input imagekFor the visual feature VkCarrying out normalization processing;
s2, obtaining the text information Q, and modifying the text information Q on the normalized visual characteristics to obtain the visual characteristics of the text modification
S3 visual characteristic decorated in textN anchor frames with different shapes are arranged at each position;
s4, for each anchor frame, obtaining an M-dimensional prediction vector corresponding to the anchor frame through a regression model, wherein the first M-1 values of the prediction vector at least represent the width, the height and the prediction offset of the horizontal and vertical coordinates of a central point, and the last value represents the confidence coefficient of the offset frame corresponding to the anchor frame obtained through a sigmoid function;
s5, calculating a forward propagation error of the regression model by using an overall loss function formed by the loss function for controlling the prediction offset and the loss function for controlling the confidence coefficient, solving a gradient of the forward propagation error, updating the regression model by using the gradient, and performing model training to finally obtain a trained proposed generation model.
The method utilizes visual features modified by text features to predict regional proposals through regression, and the whole process is only trained on a positioning data set of the appointed phrase, so that the proposal is generated without the limitation of preset categories; and the proposal generated by the invention has higher accuracy and higher identification through two loss functions which respectively control the offset of the anchor frame and the confidence coefficient.
The method of the present invention further comprises:
and S6, obtaining the prediction offset and confidence coefficient of all anchor frames by the trained proposal generating model, then sequentially outputting K anchor frames from high to low according to the prediction confidence coefficient of the anchor frames, and offsetting the anchor frames according to the width and height of the output anchor frames, the prediction offset of the horizontal coordinate of the central point and the prediction offset of the vertical coordinate of the central point, and finally obtaining the proposal of the positioning of the phrase. The invention outputs the proposal according to the confidence degree, thereby ensuring the accuracy of the proposal, and greatly improving the identification of the proposal due to the offset of the anchor frame.
In step S1, the visual feature V is correctedkThe specific implementation process for carrying out the normalization processing comprises the following steps: for the visual feature VkAnd performing convolution operation, performing normalization processing on the result of the convolution operation, and inputting the result after the normalization processing into the RELU function to obtain the visual feature after the normalization.
wherein, the product is Hadamard, (i, j) is the spatial coordinate of the visual feature after normalization, and Vk(i, j) represent the feature components of the text-modified visual features and the original visual features at spatial locations (i, j), respectively, f1 and f2To learn the mapping layer, f1() Showing the example normalization and convolution calculation, f2() Indicates that normalization, convolution, and ReLU activation function processing are sequentially performed, and p ═ tan (W)pQ+bp),q=tan(WqQ+bq);Wp、bp、Wq、bqAre all learnable parameters. The invention realizes the fusion of visual features and text information, and can learn the addition of a mapping layer and learnable parameters, so that the fused multi-modal features have stronger expression capability than a direct feature splicing method.
The expression of the penalty function controlling the prediction offset is:
wherein ,the union ratio IOU (i.e., the area ratio of the intersection of the two boxes to the union) representing the intersection between the nth anchor box and the marker box in the dataset exceeds the threshold η by 1 and 0 otherwise. Loss function L of the inventioncoordThe predicted values of the width, the height and the horizontal and vertical coordinate offset of the center point of the anchor frame calculated by the model can be close to the label value as much as possible.
In step S5, the loss function expression for controlling the confidence coefficient is:
wherein N is the number of anchor frames, siIs a predicted value of the degree of confidence,label value for confidence. The loss function LconfThe predicted value of the confidence degree of the anchor frame calculated by the model can be close to the value of the label as much as possible.
The invention also provides a referring phrase locating system based on the phrase guide proposal generation, which comprises a computer device; the computer device is configured or programmed for performing the steps of the above-described method.
The present invention also provides a computer-readable storage medium storing a program; the program is configured for performing the steps of the above-described method.
Compared with the prior art, the invention has the beneficial effects that:
the method of the present invention solves the problem of the prior art of ignoring phrase information during the proposal generation process. The invention utilizes the visual characteristics modified by phrase information, generates the proposal through regression, and designs two loss functions to respectively control the offset and the confidence coefficient of the anchor frame, so that the generated proposal has the advantages of high accuracy and high identifiability, and the performance bottleneck caused by low-quality proposal in the prior art is relieved.
Drawings
FIG. 1 is a diagram illustrating a visual comparison between the proposed result and the results generated by two proposed methods in the present invention, wherein a solid line box is a marked area, and a dashed line box is a proposal; (a) selecting a search method result graph; (b) generating a result graph by the fast-RCNN target detector; (c) the result of the method is shown;
FIG. 2 is a flowchart illustrating the steps of a phrase location offer generation process according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention is realized by the following specific steps:
first, the image visual feature V at three different resolutions was extracted using a Darknet53 feature pyramid network pre-trained on the COCO target detection dataset (Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla' r, and C Lawrence Zitnick, "Microsoft COCO: Common objects in context", in ECCV, 2014.)k(k ═ 1,2, 3), while using the language model to extract the text features Q that refer to the phrase. In order to introduce the text information of the phrase in the proposal generation process, the invention modifies all visual characteristics of the text information to obtain the visual characteristics of the text modificationThe model is then trained with two loss functions that control the anchor frame offset and confidence, respectively. Finally, for modelsRegression prediction of proposed area position coordinates is performed.
In order to obtain visual features containing phrase information, the present invention modifies visual features at three different resolutions with text features using a conditional normalization method used in the image-level task (h.de Vries, f.sub, j.mary, h.larochelle, o.pietquin, and a.courville, "Modulating early visual processing by language," in ANIPS, 2017.). First, a text feature Q is mapped into a scaling vector p and a translation vector Q by two multi-layer perceptrons:
p=tan(WpQ+bp),q=tan(WqQ+bq);
wherein ,Wp、bp、Wq、bqAre learnable parameters. Then, the text modification visual characteristics are obtained through the formula (1)
Wherein, l is a Hadamard product, (i, j) is a spatial coordinate of a visual feature, f1 and f2The learnable mapping layer shown in fig. 2. f. of1Composed of example normalized (instant Normalization) layers plus the following 1X 1 convolutional layers, f2Consisting of a batch normalization (batch normalization) layer plus the following 3 × 3 convolutional layer and the ReLU activation function. The same operation is performed for each coordinate position (i, j) of the visual feature.
As shown in stage 1 of FIG. 2, the orientation module embellishes the visual features with textAs an input, a plurality of image area position coordinates are output as a proposal. There are a total of 8 × 8+16 × 16+32 × 32 — 1344 different positions, each corresponding to a 512-dimensional vector of text-modifying visual features. Each of the visual features is modified in text by a copy of Yoloov 3(J.Redmon and A.Farhadi, "Yolov 3: An innovative improvement," arXiv preprint arXiv:1804.02767,2018.)Three Anchor frames (anchors) are arranged at each position, the size of each Anchor frame is obtained by taking (1-IOU) as a distance and clustering through K-means according to the width and height of a mark frame of a real area in a training set (which can be any phrase positioning data set). Therefore, the invention has the same number of 4032 anchor frames as N3 × 1344. For each anchor frame, 4 values are predicted through regression, and the 4 values are respectively the width, the height and the offset of the horizontal and vertical coordinates of the central point of the anchor frame, and a fifth value representing the confidence coefficient of the corresponding offset frame is obtained through a sigmoid function.
For the prediction of the confidence value, the key is how to design the loss function so that the predicted confidence value S and the confidence value label S*And (5) the consistency is achieved. Most of the existing methods (Z.Yang, B.Gong, L.Wang, W.Huang, D.Yu, and J.Luo, "A fast and acid one-stage approach to visual grouping," in ICCV, 2019.; X.Liu, Z.Wang, J.Shao, X.Wang, and H.Li, "Improving relating expressing grouping with one-stage association-determined grouping," in CVPR, 2019.) define the confidence valueThis is a one-hot vector with only one element of 1, setting the confidence label to 1 when the IOU of the anchor box is maximum with the real marker box, otherwise to 0. And then, training the model by utilizing the one-hot label to realize a cross entropy loss function.
The present invention improves upon the above-described process. Specifically, the invention sets a threshold η, then calculates the IOU score between each anchor box and the true mark box, sets the IOU score as the label of confidence if the score exceeds the threshold η 0.7, and sets the IOU score as 0 otherwise. Thus, the present invention results in a smooth confidence label An L1 regularization is then performed to ensure Σ S*1. Thus, the present invention uses Kullback-Leibler divergence (KLD) asThe loss function of the present invention causes the predicted confidence score S to approximate a smooth confidence label S*As shown in formula (3). Note that S is also L1 regularized to satisfy the same requirement as S*The same probability distribution.
The benefits of a smooth label are as follows: 1) besides the anchor frame with the largest IOU score with the real frame, other anchor frames can provide useful context information; 2) the smooth labels are actually regularizing the model, mitigating overfitting.
For the prediction of the anchor frame offset, the invention uses the MSE loss function as follows:
wherein Indicating whether the IOU between the nth anchor box and the true mark box exceeds a threshold η, σ denotes the sigmoid function. Note that the penalty function only works for the selected anchor frame that will be the proposal frame by the offset.
The overall loss function of the proposed generative model is defined as:
L=Lconf+γLcoord,
where γ is 1, a trade-off parameter.
In the testing stage, the proposed number is set to be K, and then the shifted anchor frames are sequentially output as the proposals according to the confidence coefficient of the anchor frames from high to low. To avoid outliers, the present invention eliminates boxes that are extremely small in height or width. It is noted that although the method of the present invention is similar to the single-stage REC method (x.chen, l.ma, j.chen, z.jie, w.liu, and j.luo, "Real-time regression expression by single-stage grouping network," arXiv prediction xiv:1812.03426,2018.; a.sadhu, k.chen, and r.new, "Zero-shot surroundings of objects from natural language queries," in ICCV,2019.), which also recently employs a deep regression network, the motivation of the present invention is different. The single-stage phrase positioning method directly outputs the unique region with the highest confidence as the final result, while the model of the invention outputs a plurality of regions as proposals.
In the invention, the proposed proposal generation model is an end-to-end network, an image and a corresponding expression phrase are input, and then a series of image areas are output as a proposal. For images, the invention first resizes the original image to 256 × 256, and then extracts visual features using a Darknet53(J.Redmon and A.Farhadi, "Yolov 3: An innovative improvement," arXiv predictive: 1804.02767,2018 ") feature pyramid network pre-trained on the COCO target detection dataset. The visual features extracted have 3 spatial resolutions, 8 × 8 × D1, 16 × 16 × D2, and 32 × 32 × D3, respectively. Where D1-1024, D2-512, and D3-256 are the number of channels corresponding to the features at the resolution. Finally, it is mapped to the uniform dimension D512 by a 1 × 1 convolution layer and batch normalization and RELU function. For the phrase reference, the present invention embeds it into a 768-dimensional vector by uncased version of Bert (J.Devrlin, M.Chang, K.Lee, and K.Toutanova, "Bert: Pre-training of deep biological transformations for language understating," arXiv preprint arXiv:1810.04805,2018 "), and then obtains 512-dimensional text features through two 512-neuron fully connected layers. Since spatial location information encoding may be better implemented in the proposed representation and ordering module, the present invention does not repeat this operation in the proposed generative model.
To implement a complete phrase localization model, the present invention uses a Modular network to implement the proposed representation and ordering, following MattNet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "MattNet: Modular association network for transfer-ring expression compression," in CVPR, 2018.). As shown in the second stage of fig. 2, the MAttNet uses three sub-modules related to appearance, position and relationship. The appearance submodule handles attributes such as category, color, etc., the position submodule handles absolute position and relative position, and the relationship submodule handles body-object relationships. Each module has a different structure, and respective learning parameters are in the module space of the module so as not to influence each other. The module does not use an external language parser, but instead automatically learns to parse phrases through a soft attention mechanism. The model measures the matching degree between the object and the referring phrase by calculating the matching scores of the three sub-modules.
When resizing the input image, the invention preserves the scale of the original image and resizes its long side to 256 pixels, and then fills the average of the image pixels along the short side. The method adopts a RMSProp optimization method to train the model. Invention 10-4The polynomial learning rate attenuation strategy with power of 1 is adopted. Since Darknet has been pre-trained, the present invention reduces the learning rate of the Darknet part in the model by 90%. The invention sets the IOU threshold eta to be 0.7 and the trade-off parameter gamma of the loss function to be 1. The present invention selects K-7 as the default number for generating the proposal, and relevant ablation experimental studies can be seen in the quantitative results analysis section. The batch size of all experiments in the invention was 32, and training was done on a single 1080Ti GPU. The training setup for the presentation and sequencing module is proposed to be the same as for the MAttNet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "Mattnet: modulation association network for transfer-ring expression," in CVPR, 2018.).
In the present invention, for a given phrase of reference, a prediction box is considered correct if the IOU between the prediction box and the true mark-up box is not less than 0.5.
In the simulation experiments of the present invention, 4 sets of classical phrase positioning data were used, including, for example, Refrit (S. Kazemzadeh, V.Ordonz, M.Matten, and T.berg, "Referitgame: Referring to objects in photographs of natural scenes," in EMNLP,2014.), REFCOCO (L.Yu, P.Poirson, S.Yang, A.berg, and T.berg, "Modeling context in referencing expressions," in ECCV "), REFCOCO + (L.Yu, P.Poirson, S.Yang, A.berg, and T.berg," Modeling context in referencing expressions, "in ECCV"), and COCOG (J.J.Huuang, A.Touchouo, C.CVyield, and T.berg, "in linking context in referencing expressions," in Recordv, "and COG.2016). The Refrit contains 20000 images from SAIAPR-12[27], and the invention applies a clean split version of a training, validation and test set containing 9000, 1000 and 10000 images, respectively (K.Chen, R.Kovvuri, and R.Nevatia, "Query-guided regression network with context polarity for graphics grouping," in ICCV, 2017.). RefCOCO marks 50,000 target entities from 19,994 pictures. RefCOCO + 49,856 target entities were marked from 19,992 pictures. The two data sets are divided into four parts, train, val, testA and testB. RefCOCOg contains 49,822 target entities from 25799 images, which are divided into three parts, train, val, and test.
Table 1 comparison of performance of different phrase location methods on RefCOCO, RefCOCO +, and refcocg datasets (acc @ 0.5%). The method of labelling with (×) is a single stage method.
Table 2 comparison of performance (acc @ 0.5%) of different phrase location methods on the referent dataset. The method of labelling with (×) is a single stage method.
Table 3 suggests a generator performance comparison (acc @ 0.5%).
TABLE 4 ablation study results for proposed number (acc @ 0.5%)
Referring to the phrase positioning result: the process of the present invention was compared to other SOTA processes. Tables 1 and 2 show the positioning results of the phrase references on the COCO series data set and the reference data set, respectively. The results show that the model of the invention is superior to the existing two-stage SOTA method and the single-stage SOTA method. Particularly for the reference dataset which is not a COCO series dataset, the method of the invention has far more effect than the existing two-stage method. In addition, the table also shows the performance of the model trained under different loss functions, and it can be seen that the training under the KLD loss function improves the model performance by 0.6 to 1.8 points compared with the conventional method using the cross entropy loss function.
Performance comparison of the proposed generators: in order to further study the performance of the proposed generators in the present invention, the present invention used the proposed representation and ranking method in MAttnet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "Mattnet: modulation association network for transfer-ring expression complex," in CVPR,2018.) and only the proposed generators were changed for comparative experiments. The present invention compares three mainstream proposed generators, namely Edgebox (C.Zitnick and P.Dolla r, "Edge boxes: Locating objects from streams," in ECCV, 2014.), Select Search (J.Uijlinks, K.Van De Sande, T.Gevers, and A.Smeulders, "selecting Search for object registration," IJCV, 2013 "), and FRCN target detector (S.Ren, K.He, R.Girshick, and J.Sunn," fast Rcnn: war real-time object detection with prediction probes, "TPAMI, network"), with the final results shown in Table 3. It can be seen that the method proposed by the present invention outperforms the other methods on all datasets. FRCN trained on COCO also performs well on COCO series data sets (RefCOCO, RefCOCO +). For example, FRCN performs as well in val and testA of RefCOCO + as the proposed generation model of the present invention, although the model of the present invention is significantly superior to FRCN in testB of RefCOCOz and RefCOCO +. However, in the refer dataset, the performance of FRCN drops significantly. This is because RefCOCO/RefCOCO + is a subset of COCO and sharing the image and target entity, a target detector trained on COCO can generate an almost perfect region proposal on the COCO series dataset. However, when the present invention applies it to other data sets, such as the referent data set, their performance will be significantly degraded. However, the proposed generator works well on all datasets in the present invention.
Ablation experiment: the present invention performs ablation studies on the referent dataset on different proposed production quantities K, the results of which are shown in table 4. The present invention observes that increasing the number of generated proposals after a certain threshold is reached can no longer improve the accuracy of the model (e.g., K ≧ 7 at Refert). Therefore, in the experiments of the present invention, K ═ 7 was chosen as the default value by the present invention. The reason for the analysis of the present invention is that since the method of the present invention generates the proposal according to the confidence of the anchor box from high to low, when K is 7, the hit rate to the real tag box is already quite high. If K is further increased, the increase in hit rate will not be sufficient to offset the negative effects of the proposed redundancy.
Claims (10)
1. A method for locating a phrase, comprising the steps of:
s1, extracting visual features V of different scales of the input imagekFor the visual feature VkCarrying out normalization processing;
s2, obtaining the text information Q, and modifying the text information Q on the normalized visual characteristics to obtain the visual characteristics of the text modification
S3 visual characteristic decorated in textN anchor frames with different shapes are arranged at each position;
s4, for each anchor frame, obtaining a 5-dimensional prediction vector corresponding to the anchor frame through a regression model, wherein the first 4 values of the prediction vector at least represent the width, the height and the prediction offset of the horizontal and vertical coordinates of a central point, and the last value represents the confidence coefficient of the anchor frame corresponding to the offset frame obtained through a sigmoid function;
s5, using a loss function L for controlling the prediction offsetcoordAnd a loss function L controlling said confidenceconfAnd the constructed integral loss function L calculates the forward propagation error of the regression model, calculates the gradient of the forward propagation error, updates the regression model by using the gradient to carry out model training, and finally obtains a trained proposed generation model.
2. The phrase location method as recited in claim 1, further comprising:
and S6, obtaining the prediction offset and confidence coefficient of all anchor frames by the trained proposal generating model, then sequentially outputting K anchor frames from high to low according to the prediction confidence coefficient of the anchor frames, and offsetting the anchor frames according to the width and height of the output anchor frames, the prediction offset of the horizontal coordinate of the central point and the prediction offset of the vertical coordinate of the central point, and finally obtaining the proposal of the positioning of the phrase.
3. The method for locating phrases according to claim 1, wherein in step S1, the visual feature V is determinedkThe specific implementation process for carrying out the normalization processing comprises the following steps: for the visual feature VkAnd performing convolution operation, performing normalization processing on the result of the convolution operation, and inputting the result after the normalization processing into the RELU function to obtain the visual feature after the normalization.
4. The method for locating phrases according to claim 1, wherein in step S2, the visual characteristics of the text modificationThe expression of (a) is:
wherein, l is a Hadamard product, and (i, j) is normalizedThe spatial coordinates of the visual features after the transformation, and Vk(i, j) represent the feature components of the text-modified visual features and the original visual features at spatial locations (i, j), respectively, f1() and f2() To learn the mapping layer, f1() Showing the example normalization and convolution calculation, f2() Indicates that normalization, convolution, and ReLU activation function processing are sequentially performed, and p ═ tan (W)pQ+bp),q=tan(W,Q+b,);Wp、bp、W,、b,Are all learnable parameters.
5. The method of claim 4, wherein f is1() Comprises an example normalization layer and a 1 × 1 convolution layer which are connected in sequence; f. of2() Comprises a batch normalization layer, a 3 x 3 convolution layer and a ReLU activation function which are connected in sequence.
6. The method for locating a phrase as claimed in claim 1, wherein in step S5, the expression of the loss function for controlling the prediction offset is:
wherein ,indicating the intersection ratio IOU between the nth anchor frame and the marker frame in the dataset,the value is 1 when the threshold eta is exceeded, otherwise, the value is 0; σ denotes sigmoid function, (t)w)n、(t6)n、(tx)n、(ty)nIs the width and height of the nth anchor frameDegree, predicted offset of the abscissa and predicted offset of the ordinate of the center point,and N is the width, height, and offset label of the abscissa and ordinate of the center point of the nth anchor frame, wherein N is 1,2, … and N.
8. The method as claimed in any one of claims 1 to 7, wherein in step S5, L ═ γ Lcoord+Lconf(ii) a γ is a trade-off parameter.
9. A phrase location system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 8.
10. A computer-readable storage medium characterized by storing a program; the program is configured for carrying out the steps of the method according to one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110138354.2A CN112966519B (en) | 2021-02-01 | 2021-02-01 | Phrase positioning method, system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110138354.2A CN112966519B (en) | 2021-02-01 | 2021-02-01 | Phrase positioning method, system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112966519A true CN112966519A (en) | 2021-06-15 |
CN112966519B CN112966519B (en) | 2023-10-10 |
Family
ID=76272908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110138354.2A Active CN112966519B (en) | 2021-02-01 | 2021-02-01 | Phrase positioning method, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112966519B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190043479A1 (en) * | 2018-05-07 | 2019-02-07 | Intel Corporation | Wake on voice key phrase segmentation |
CN111898566A (en) * | 2020-08-04 | 2020-11-06 | 成都井之丽科技有限公司 | Attitude estimation method, attitude estimation device, electronic equipment and storage medium |
CN111931915A (en) * | 2020-08-06 | 2020-11-13 | 中国科学院重庆绿色智能技术研究院 | Method for training network based on DIOU loss function |
-
2021
- 2021-02-01 CN CN202110138354.2A patent/CN112966519B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190043479A1 (en) * | 2018-05-07 | 2019-02-07 | Intel Corporation | Wake on voice key phrase segmentation |
CN111898566A (en) * | 2020-08-04 | 2020-11-06 | 成都井之丽科技有限公司 | Attitude estimation method, attitude estimation device, electronic equipment and storage medium |
CN111931915A (en) * | 2020-08-06 | 2020-11-13 | 中国科学院重庆绿色智能技术研究院 | Method for training network based on DIOU loss function |
Non-Patent Citations (1)
Title |
---|
李志欣;魏海洋;黄飞成;张灿龙;马慧芳;史忠植;: "结合视觉特征和场景语义的图像描述生成", 计算机学报, no. 09 * |
Also Published As
Publication number | Publication date |
---|---|
CN112966519B (en) | 2023-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110021051B (en) | Human image generation method based on generation of confrontation network through text guidance | |
CN113158862B (en) | Multitasking-based lightweight real-time face detection method | |
CN110651276A (en) | Tagging and online incremental real-time learning of data streams for deep neural networks and neural network applications | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN109670576B (en) | Multi-scale visual attention image description method | |
CN105718952A (en) | Method for focus classification of sectional medical images by employing deep learning network | |
CN113255457A (en) | Animation character facial expression generation method and system based on facial expression recognition | |
Dandıl et al. | Real-time facial emotion classification using deep learning | |
CN113343705A (en) | Text semantic based detail preservation image generation method and system | |
He et al. | Open-vocabulary multi-label classification via multi-modal knowledge transfer | |
CN110347857A (en) | The semanteme marking method of remote sensing image based on intensified learning | |
CN114998220A (en) | Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment | |
Gomez et al. | Selective style transfer for text | |
Strudel et al. | Weakly-supervised segmentation of referring expressions | |
CN113283336A (en) | Text recognition method and system | |
Wang et al. | Head: Hetero-assists distillation for heterogeneous object detectors | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
Choi | CNN output optimization for more balanced classification | |
Lu et al. | Prediction calibration for generalized few-shot semantic segmentation | |
Zhu et al. | Unpaired image captioning by image-level weakly-supervised visual concept recognition | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
Chen et al. | Code generation from a graphical user interface via attention-based encoder–decoder model | |
Zhu et al. | Dual-decoder transformer network for answer grounding in visual question answering | |
CN108108652A (en) | A kind of across visual angle Human bodys' response method and device based on dictionary learning | |
CN112966519B (en) | Phrase positioning method, system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |