CN112966519A - Method, system and storage medium for positioning reference phrase - Google Patents

Method, system and storage medium for positioning reference phrase Download PDF

Info

Publication number
CN112966519A
CN112966519A CN202110138354.2A CN202110138354A CN112966519A CN 112966519 A CN112966519 A CN 112966519A CN 202110138354 A CN202110138354 A CN 202110138354A CN 112966519 A CN112966519 A CN 112966519A
Authority
CN
China
Prior art keywords
phrase
offset
visual
anchor
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110138354.2A
Other languages
Chinese (zh)
Other versions
CN112966519B (en
Inventor
杨超
汪国庆
蒋斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202110138354.2A priority Critical patent/CN112966519B/en
Publication of CN112966519A publication Critical patent/CN112966519A/en
Application granted granted Critical
Publication of CN112966519B publication Critical patent/CN112966519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method, a system and a storage medium for positioning a phrase, which utilize visual features modified by phrase information to generate an offer through regression, and design two loss functions to respectively control the offset and the confidence of an anchor frame, so that the generated offer has the advantages of high accuracy and high identification at the same time, and the performance bottleneck caused by low-quality offer in the prior art is relieved.

Description

Method, system and storage medium for positioning reference phrase
Technical Field
The invention relates to the technical field of human-computer interaction, in particular to a method and a system for positioning a referring phrase generated based on a phrase guide proposal.
Background
The reference to phrase location (REC) is one of the core tasks of human-computer interaction. The phrase locating task may be broadly described as locating entities associated with the expressive phrase through a bounding box (e.g., "a girl wearing a pink shirt, holding a red umbrella"). The phrase locating task is very challenging because it requires not only understanding the fine-grained semantic information of images and natural language phrases, but also aligning and associating them to locate accurately.
The basic mode of the phrase locating method is that given an input image, a certain number of area proposals are generated by a proposal generator, and then the visual features of each proposal are extracted. Meanwhile, a language model (such as Bert) encodes a corresponding phrasing phrase into a language feature. The visual features and the language features are then input to a multimodal fusion module to generate fused features. Finally, all proposed fused features are ranked using a proposed ranking module, and the proposed regions with the highest ranking scores are output as results.
Proposal generation refers to a very important step in the phrase locating process, and if this step is wrong, all the following processing flows will be wrong. Conventional proposal generation methods can be divided into non-training methods (e.g., Selective Search, Edgebox) and class-aware target detectors (e.g., fast-RCNN trained on 80-class COCO datasets). Non-training methods typically use certain features of the image itself to generate region proposals, such as edge information for the image. However, the area proposal generated by this type of method lacks discriminative properties, and thus it is difficult to hit the real area in a limited number of proposals. The class aware target detector can accurately generate high discriminative offers only if the referring object is included in a preset category, however, the vocabulary of the referring phrase is usually open, and if the referring object is not in a preset category, e.g. "window", such as not in a preset category of the COCO, the correct area offer cannot be generated. Furthermore, both of these methods have a fatal drawback in that they generate the area proposal only from the image information without considering the reference phrase, which may result in a large number of redundant proposals unrelated to the reference phrase.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a system for positioning a reference phrase aiming at the defects of the prior art, so as to improve the accuracy and the identification degree of the generated proposal.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for positioning a referring phrase generated based on a phrase guide offer, comprising the steps of:
s1, extracting visual features V of different scales of the input imagekFor the visual feature VkCarrying out normalization processing;
s2, obtaining the text information Q, and modifying the text information Q on the normalized visual characteristics to obtain the visual characteristics of the text modification
Figure BDA0002927861560000021
S3 visual characteristic decorated in text
Figure BDA0002927861560000022
N anchor frames with different shapes are arranged at each position;
s4, for each anchor frame, obtaining an M-dimensional prediction vector corresponding to the anchor frame through a regression model, wherein the first M-1 values of the prediction vector at least represent the width, the height and the prediction offset of the horizontal and vertical coordinates of a central point, and the last value represents the confidence coefficient of the offset frame corresponding to the anchor frame obtained through a sigmoid function;
s5, calculating a forward propagation error of the regression model by using an overall loss function formed by the loss function for controlling the prediction offset and the loss function for controlling the confidence coefficient, solving a gradient of the forward propagation error, updating the regression model by using the gradient, and performing model training to finally obtain a trained proposed generation model.
The method utilizes visual features modified by text features to predict regional proposals through regression, and the whole process is only trained on a positioning data set of the appointed phrase, so that the proposal is generated without the limitation of preset categories; and the proposal generated by the invention has higher accuracy and higher identification through two loss functions which respectively control the offset of the anchor frame and the confidence coefficient.
The method of the present invention further comprises:
and S6, obtaining the prediction offset and confidence coefficient of all anchor frames by the trained proposal generating model, then sequentially outputting K anchor frames from high to low according to the prediction confidence coefficient of the anchor frames, and offsetting the anchor frames according to the width and height of the output anchor frames, the prediction offset of the horizontal coordinate of the central point and the prediction offset of the vertical coordinate of the central point, and finally obtaining the proposal of the positioning of the phrase. The invention outputs the proposal according to the confidence degree, thereby ensuring the accuracy of the proposal, and greatly improving the identification of the proposal due to the offset of the anchor frame.
In step S1, the visual feature V is correctedkThe specific implementation process for carrying out the normalization processing comprises the following steps: for the visual feature VkAnd performing convolution operation, performing normalization processing on the result of the convolution operation, and inputting the result after the normalization processing into the RELU function to obtain the visual feature after the normalization.
In step S2, visual characteristics of text embellishment
Figure BDA0002927861560000023
The expression of (a) is:
Figure BDA0002927861560000024
wherein, the product is Hadamard, (i, j) is the spatial coordinate of the visual feature after normalization,
Figure BDA0002927861560000025
and Vk(i, j) represent the feature components of the text-modified visual features and the original visual features at spatial locations (i, j), respectively, f1 and f2To learn the mapping layer, f1() Showing the example normalization and convolution calculation, f2() Indicates that normalization, convolution, and ReLU activation function processing are sequentially performed, and p ═ tan (W)pQ+bp),q=tan(WqQ+bq);Wp、bp、Wq、bqAre all learnable parameters. The invention realizes the fusion of visual features and text information, and can learn the addition of a mapping layer and learnable parameters, so that the fused multi-modal features have stronger expression capability than a direct feature splicing method.
The expression of the penalty function controlling the prediction offset is:
Figure BDA0002927861560000031
wherein ,
Figure BDA0002927861560000032
the union ratio IOU (i.e., the area ratio of the intersection of the two boxes to the union) representing the intersection between the nth anchor box and the marker box in the dataset exceeds the threshold η by 1 and 0 otherwise. Loss function L of the inventioncoordThe predicted values of the width, the height and the horizontal and vertical coordinate offset of the center point of the anchor frame calculated by the model can be close to the label value as much as possible.
In step S5, the loss function expression for controlling the confidence coefficient is:
Figure BDA0002927861560000033
wherein N is the number of anchor frames, siIs a predicted value of the degree of confidence,
Figure BDA0002927861560000034
label value for confidence. The loss function LconfThe predicted value of the confidence degree of the anchor frame calculated by the model can be close to the value of the label as much as possible.
The invention also provides a referring phrase locating system based on the phrase guide proposal generation, which comprises a computer device; the computer device is configured or programmed for performing the steps of the above-described method.
The present invention also provides a computer-readable storage medium storing a program; the program is configured for performing the steps of the above-described method.
Compared with the prior art, the invention has the beneficial effects that:
the method of the present invention solves the problem of the prior art of ignoring phrase information during the proposal generation process. The invention utilizes the visual characteristics modified by phrase information, generates the proposal through regression, and designs two loss functions to respectively control the offset and the confidence coefficient of the anchor frame, so that the generated proposal has the advantages of high accuracy and high identifiability, and the performance bottleneck caused by low-quality proposal in the prior art is relieved.
Drawings
FIG. 1 is a diagram illustrating a visual comparison between the proposed result and the results generated by two proposed methods in the present invention, wherein a solid line box is a marked area, and a dashed line box is a proposal; (a) selecting a search method result graph; (b) generating a result graph by the fast-RCNN target detector; (c) the result of the method is shown;
FIG. 2 is a flowchart illustrating the steps of a phrase location offer generation process according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention is realized by the following specific steps:
first, the image visual feature V at three different resolutions was extracted using a Darknet53 feature pyramid network pre-trained on the COCO target detection dataset (Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla' r, and C Lawrence Zitnick, "Microsoft COCO: Common objects in context", in ECCV, 2014.)k(k ═ 1,2, 3), while using the language model to extract the text features Q that refer to the phrase. In order to introduce the text information of the phrase in the proposal generation process, the invention modifies all visual characteristics of the text information to obtain the visual characteristics of the text modification
Figure BDA0002927861560000041
The model is then trained with two loss functions that control the anchor frame offset and confidence, respectively. Finally, for models
Figure BDA0002927861560000042
Regression prediction of proposed area position coordinates is performed.
In order to obtain visual features containing phrase information, the present invention modifies visual features at three different resolutions with text features using a conditional normalization method used in the image-level task (h.de Vries, f.sub, j.mary, h.larochelle, o.pietquin, and a.courville, "Modulating early visual processing by language," in ANIPS, 2017.). First, a text feature Q is mapped into a scaling vector p and a translation vector Q by two multi-layer perceptrons:
p=tan(WpQ+bp),q=tan(WqQ+bq);
wherein ,Wp、bp、Wq、bqAre learnable parameters. Then, the text modification visual characteristics are obtained through the formula (1)
Figure BDA0002927861560000043
Figure BDA0002927861560000044
Wherein, l is a Hadamard product, (i, j) is a spatial coordinate of a visual feature, f1 and f2The learnable mapping layer shown in fig. 2. f. of1Composed of example normalized (instant Normalization) layers plus the following 1X 1 convolutional layers, f2Consisting of a batch normalization (batch normalization) layer plus the following 3 × 3 convolutional layer and the ReLU activation function. The same operation is performed for each coordinate position (i, j) of the visual feature.
As shown in stage 1 of FIG. 2, the orientation module embellishes the visual features with text
Figure BDA0002927861560000045
As an input, a plurality of image area position coordinates are output as a proposal. There are a total of 8 × 8+16 × 16+32 × 32 — 1344 different positions, each corresponding to a 512-dimensional vector of text-modifying visual features. Each of the visual features is modified in text by a copy of Yoloov 3(J.Redmon and A.Farhadi, "Yolov 3: An innovative improvement," arXiv preprint arXiv:1804.02767,2018.)Three Anchor frames (anchors) are arranged at each position, the size of each Anchor frame is obtained by taking (1-IOU) as a distance and clustering through K-means according to the width and height of a mark frame of a real area in a training set (which can be any phrase positioning data set). Therefore, the invention has the same number of 4032 anchor frames as N3 × 1344. For each anchor frame, 4 values are predicted through regression, and the 4 values are respectively the width, the height and the offset of the horizontal and vertical coordinates of the central point of the anchor frame, and a fifth value representing the confidence coefficient of the corresponding offset frame is obtained through a sigmoid function.
For the prediction of the confidence value, the key is how to design the loss function so that the predicted confidence value S and the confidence value label S*And (5) the consistency is achieved. Most of the existing methods (Z.Yang, B.Gong, L.Wang, W.Huang, D.Yu, and J.Luo, "A fast and acid one-stage approach to visual grouping," in ICCV, 2019.; X.Liu, Z.Wang, J.Shao, X.Wang, and H.Li, "Improving relating expressing grouping with one-stage association-determined grouping," in CVPR, 2019.) define the confidence value
Figure BDA0002927861560000051
This is a one-hot vector with only one element of 1, setting the confidence label to 1 when the IOU of the anchor box is maximum with the real marker box, otherwise to 0. And then, training the model by utilizing the one-hot label to realize a cross entropy loss function.
The present invention improves upon the above-described process. Specifically, the invention sets a threshold η, then calculates the IOU score between each anchor box and the true mark box, sets the IOU score as the label of confidence if the score exceeds the threshold η 0.7, and sets the IOU score as 0 otherwise. Thus, the present invention results in a smooth confidence label
Figure BDA0002927861560000052
Figure BDA0002927861560000053
An L1 regularization is then performed to ensure Σ S*1. Thus, the present invention uses Kullback-Leibler divergence (KLD) asThe loss function of the present invention causes the predicted confidence score S to approximate a smooth confidence label S*As shown in formula (3). Note that S is also L1 regularized to satisfy the same requirement as S*The same probability distribution.
Figure BDA0002927861560000054
The benefits of a smooth label are as follows: 1) besides the anchor frame with the largest IOU score with the real frame, other anchor frames can provide useful context information; 2) the smooth labels are actually regularizing the model, mitigating overfitting.
For the prediction of the anchor frame offset, the invention uses the MSE loss function as follows:
Figure BDA0002927861560000055
wherein
Figure BDA0002927861560000056
Indicating whether the IOU between the nth anchor box and the true mark box exceeds a threshold η, σ denotes the sigmoid function. Note that the penalty function only works for the selected anchor frame that will be the proposal frame by the offset.
The overall loss function of the proposed generative model is defined as:
L=Lconf+γLcoord
where γ is 1, a trade-off parameter.
In the testing stage, the proposed number is set to be K, and then the shifted anchor frames are sequentially output as the proposals according to the confidence coefficient of the anchor frames from high to low. To avoid outliers, the present invention eliminates boxes that are extremely small in height or width. It is noted that although the method of the present invention is similar to the single-stage REC method (x.chen, l.ma, j.chen, z.jie, w.liu, and j.luo, "Real-time regression expression by single-stage grouping network," arXiv prediction xiv:1812.03426,2018.; a.sadhu, k.chen, and r.new, "Zero-shot surroundings of objects from natural language queries," in ICCV,2019.), which also recently employs a deep regression network, the motivation of the present invention is different. The single-stage phrase positioning method directly outputs the unique region with the highest confidence as the final result, while the model of the invention outputs a plurality of regions as proposals.
In the invention, the proposed proposal generation model is an end-to-end network, an image and a corresponding expression phrase are input, and then a series of image areas are output as a proposal. For images, the invention first resizes the original image to 256 × 256, and then extracts visual features using a Darknet53(J.Redmon and A.Farhadi, "Yolov 3: An innovative improvement," arXiv predictive: 1804.02767,2018 ") feature pyramid network pre-trained on the COCO target detection dataset. The visual features extracted have 3 spatial resolutions, 8 × 8 × D1, 16 × 16 × D2, and 32 × 32 × D3, respectively. Where D1-1024, D2-512, and D3-256 are the number of channels corresponding to the features at the resolution. Finally, it is mapped to the uniform dimension D512 by a 1 × 1 convolution layer and batch normalization and RELU function. For the phrase reference, the present invention embeds it into a 768-dimensional vector by uncased version of Bert (J.Devrlin, M.Chang, K.Lee, and K.Toutanova, "Bert: Pre-training of deep biological transformations for language understating," arXiv preprint arXiv:1810.04805,2018 "), and then obtains 512-dimensional text features through two 512-neuron fully connected layers. Since spatial location information encoding may be better implemented in the proposed representation and ordering module, the present invention does not repeat this operation in the proposed generative model.
To implement a complete phrase localization model, the present invention uses a Modular network to implement the proposed representation and ordering, following MattNet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "MattNet: Modular association network for transfer-ring expression compression," in CVPR, 2018.). As shown in the second stage of fig. 2, the MAttNet uses three sub-modules related to appearance, position and relationship. The appearance submodule handles attributes such as category, color, etc., the position submodule handles absolute position and relative position, and the relationship submodule handles body-object relationships. Each module has a different structure, and respective learning parameters are in the module space of the module so as not to influence each other. The module does not use an external language parser, but instead automatically learns to parse phrases through a soft attention mechanism. The model measures the matching degree between the object and the referring phrase by calculating the matching scores of the three sub-modules.
When resizing the input image, the invention preserves the scale of the original image and resizes its long side to 256 pixels, and then fills the average of the image pixels along the short side. The method adopts a RMSProp optimization method to train the model. Invention 10-4The polynomial learning rate attenuation strategy with power of 1 is adopted. Since Darknet has been pre-trained, the present invention reduces the learning rate of the Darknet part in the model by 90%. The invention sets the IOU threshold eta to be 0.7 and the trade-off parameter gamma of the loss function to be 1. The present invention selects K-7 as the default number for generating the proposal, and relevant ablation experimental studies can be seen in the quantitative results analysis section. The batch size of all experiments in the invention was 32, and training was done on a single 1080Ti GPU. The training setup for the presentation and sequencing module is proposed to be the same as for the MAttNet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "Mattnet: modulation association network for transfer-ring expression," in CVPR, 2018.).
In the present invention, for a given phrase of reference, a prediction box is considered correct if the IOU between the prediction box and the true mark-up box is not less than 0.5.
In the simulation experiments of the present invention, 4 sets of classical phrase positioning data were used, including, for example, Refrit (S. Kazemzadeh, V.Ordonz, M.Matten, and T.berg, "Referitgame: Referring to objects in photographs of natural scenes," in EMNLP,2014.), REFCOCO (L.Yu, P.Poirson, S.Yang, A.berg, and T.berg, "Modeling context in referencing expressions," in ECCV "), REFCOCO + (L.Yu, P.Poirson, S.Yang, A.berg, and T.berg," Modeling context in referencing expressions, "in ECCV"), and COCOG (J.J.Huuang, A.Touchouo, C.CVyield, and T.berg, "in linking context in referencing expressions," in Recordv, "and COG.2016). The Refrit contains 20000 images from SAIAPR-12[27], and the invention applies a clean split version of a training, validation and test set containing 9000, 1000 and 10000 images, respectively (K.Chen, R.Kovvuri, and R.Nevatia, "Query-guided regression network with context polarity for graphics grouping," in ICCV, 2017.). RefCOCO marks 50,000 target entities from 19,994 pictures. RefCOCO + 49,856 target entities were marked from 19,992 pictures. The two data sets are divided into four parts, train, val, testA and testB. RefCOCOg contains 49,822 target entities from 25799 images, which are divided into three parts, train, val, and test.
Table 1 comparison of performance of different phrase location methods on RefCOCO, RefCOCO +, and refcocg datasets (acc @ 0.5%). The method of labelling with (×) is a single stage method.
Figure BDA0002927861560000081
Table 2 comparison of performance (acc @ 0.5%) of different phrase location methods on the referent dataset. The method of labelling with (×) is a single stage method.
Figure BDA0002927861560000082
Table 3 suggests a generator performance comparison (acc @ 0.5%).
Figure BDA0002927861560000083
TABLE 4 ablation study results for proposed number (acc @ 0.5%)
Figure BDA0002927861560000084
Referring to the phrase positioning result: the process of the present invention was compared to other SOTA processes. Tables 1 and 2 show the positioning results of the phrase references on the COCO series data set and the reference data set, respectively. The results show that the model of the invention is superior to the existing two-stage SOTA method and the single-stage SOTA method. Particularly for the reference dataset which is not a COCO series dataset, the method of the invention has far more effect than the existing two-stage method. In addition, the table also shows the performance of the model trained under different loss functions, and it can be seen that the training under the KLD loss function improves the model performance by 0.6 to 1.8 points compared with the conventional method using the cross entropy loss function.
Performance comparison of the proposed generators: in order to further study the performance of the proposed generators in the present invention, the present invention used the proposed representation and ranking method in MAttnet (L.Yu, Z.Lin, X.Shen, J.Yang,. Lu, M.Bansal, and T.berg, "Mattnet: modulation association network for transfer-ring expression complex," in CVPR,2018.) and only the proposed generators were changed for comparative experiments. The present invention compares three mainstream proposed generators, namely Edgebox (C.Zitnick and P.Dolla r, "Edge boxes: Locating objects from streams," in ECCV, 2014.), Select Search (J.Uijlinks, K.Van De Sande, T.Gevers, and A.Smeulders, "selecting Search for object registration," IJCV, 2013 "), and FRCN target detector (S.Ren, K.He, R.Girshick, and J.Sunn," fast Rcnn: war real-time object detection with prediction probes, "TPAMI, network"), with the final results shown in Table 3. It can be seen that the method proposed by the present invention outperforms the other methods on all datasets. FRCN trained on COCO also performs well on COCO series data sets (RefCOCO, RefCOCO +). For example, FRCN performs as well in val and testA of RefCOCO + as the proposed generation model of the present invention, although the model of the present invention is significantly superior to FRCN in testB of RefCOCOz and RefCOCO +. However, in the refer dataset, the performance of FRCN drops significantly. This is because RefCOCO/RefCOCO + is a subset of COCO and sharing the image and target entity, a target detector trained on COCO can generate an almost perfect region proposal on the COCO series dataset. However, when the present invention applies it to other data sets, such as the referent data set, their performance will be significantly degraded. However, the proposed generator works well on all datasets in the present invention.
Ablation experiment: the present invention performs ablation studies on the referent dataset on different proposed production quantities K, the results of which are shown in table 4. The present invention observes that increasing the number of generated proposals after a certain threshold is reached can no longer improve the accuracy of the model (e.g., K ≧ 7 at Refert). Therefore, in the experiments of the present invention, K ═ 7 was chosen as the default value by the present invention. The reason for the analysis of the present invention is that since the method of the present invention generates the proposal according to the confidence of the anchor box from high to low, when K is 7, the hit rate to the real tag box is already quite high. If K is further increased, the increase in hit rate will not be sufficient to offset the negative effects of the proposed redundancy.

Claims (10)

1. A method for locating a phrase, comprising the steps of:
s1, extracting visual features V of different scales of the input imagekFor the visual feature VkCarrying out normalization processing;
s2, obtaining the text information Q, and modifying the text information Q on the normalized visual characteristics to obtain the visual characteristics of the text modification
Figure FDA0002927861550000011
S3 visual characteristic decorated in text
Figure FDA0002927861550000012
N anchor frames with different shapes are arranged at each position;
s4, for each anchor frame, obtaining a 5-dimensional prediction vector corresponding to the anchor frame through a regression model, wherein the first 4 values of the prediction vector at least represent the width, the height and the prediction offset of the horizontal and vertical coordinates of a central point, and the last value represents the confidence coefficient of the anchor frame corresponding to the offset frame obtained through a sigmoid function;
s5, using a loss function L for controlling the prediction offsetcoordAnd a loss function L controlling said confidenceconfAnd the constructed integral loss function L calculates the forward propagation error of the regression model, calculates the gradient of the forward propagation error, updates the regression model by using the gradient to carry out model training, and finally obtains a trained proposed generation model.
2. The phrase location method as recited in claim 1, further comprising:
and S6, obtaining the prediction offset and confidence coefficient of all anchor frames by the trained proposal generating model, then sequentially outputting K anchor frames from high to low according to the prediction confidence coefficient of the anchor frames, and offsetting the anchor frames according to the width and height of the output anchor frames, the prediction offset of the horizontal coordinate of the central point and the prediction offset of the vertical coordinate of the central point, and finally obtaining the proposal of the positioning of the phrase.
3. The method for locating phrases according to claim 1, wherein in step S1, the visual feature V is determinedkThe specific implementation process for carrying out the normalization processing comprises the following steps: for the visual feature VkAnd performing convolution operation, performing normalization processing on the result of the convolution operation, and inputting the result after the normalization processing into the RELU function to obtain the visual feature after the normalization.
4. The method for locating phrases according to claim 1, wherein in step S2, the visual characteristics of the text modification
Figure FDA0002927861550000013
The expression of (a) is:
Figure FDA0002927861550000014
wherein, l is a Hadamard product, and (i, j) is normalizedThe spatial coordinates of the visual features after the transformation,
Figure FDA0002927861550000015
and Vk(i, j) represent the feature components of the text-modified visual features and the original visual features at spatial locations (i, j), respectively, f1() and f2() To learn the mapping layer, f1() Showing the example normalization and convolution calculation, f2() Indicates that normalization, convolution, and ReLU activation function processing are sequentially performed, and p ═ tan (W)pQ+bp),q=tan(W,Q+b,);Wp、bp、W,、b,Are all learnable parameters.
5. The method of claim 4, wherein f is1() Comprises an example normalization layer and a 1 × 1 convolution layer which are connected in sequence; f. of2() Comprises a batch normalization layer, a 3 x 3 convolution layer and a ReLU activation function which are connected in sequence.
6. The method for locating a phrase as claimed in claim 1, wherein in step S5, the expression of the loss function for controlling the prediction offset is:
Figure FDA0002927861550000021
wherein ,
Figure FDA0002927861550000022
indicating the intersection ratio IOU between the nth anchor frame and the marker frame in the dataset,
Figure FDA0002927861550000023
the value is 1 when the threshold eta is exceeded, otherwise, the value is 0; σ denotes sigmoid function, (t)w)n、(t6)n、(tx)n、(ty)nIs the width and height of the nth anchor frameDegree, predicted offset of the abscissa and predicted offset of the ordinate of the center point,
Figure FDA0002927861550000024
and N is the width, height, and offset label of the abscissa and ordinate of the center point of the nth anchor frame, wherein N is 1,2, … and N.
7. The method for locating a phrase as claimed in claim 1, wherein in step S5, the loss function expression for controlling the confidence coefficient is:
Figure FDA0002927861550000025
wherein N is the number of anchor frames, siIs a predicted value of the degree of confidence,
Figure FDA0002927861550000026
label value for confidence.
8. The method as claimed in any one of claims 1 to 7, wherein in step S5, L ═ γ Lcoord+Lconf(ii) a γ is a trade-off parameter.
9. A phrase location system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 8.
10. A computer-readable storage medium characterized by storing a program; the program is configured for carrying out the steps of the method according to one of claims 1 to 8.
CN202110138354.2A 2021-02-01 2021-02-01 Phrase positioning method, system and storage medium Active CN112966519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110138354.2A CN112966519B (en) 2021-02-01 2021-02-01 Phrase positioning method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110138354.2A CN112966519B (en) 2021-02-01 2021-02-01 Phrase positioning method, system and storage medium

Publications (2)

Publication Number Publication Date
CN112966519A true CN112966519A (en) 2021-06-15
CN112966519B CN112966519B (en) 2023-10-10

Family

ID=76272908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110138354.2A Active CN112966519B (en) 2021-02-01 2021-02-01 Phrase positioning method, system and storage medium

Country Status (1)

Country Link
CN (1) CN112966519B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190043479A1 (en) * 2018-05-07 2019-02-07 Intel Corporation Wake on voice key phrase segmentation
CN111898566A (en) * 2020-08-04 2020-11-06 成都井之丽科技有限公司 Attitude estimation method, attitude estimation device, electronic equipment and storage medium
CN111931915A (en) * 2020-08-06 2020-11-13 中国科学院重庆绿色智能技术研究院 Method for training network based on DIOU loss function

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190043479A1 (en) * 2018-05-07 2019-02-07 Intel Corporation Wake on voice key phrase segmentation
CN111898566A (en) * 2020-08-04 2020-11-06 成都井之丽科技有限公司 Attitude estimation method, attitude estimation device, electronic equipment and storage medium
CN111931915A (en) * 2020-08-06 2020-11-13 中国科学院重庆绿色智能技术研究院 Method for training network based on DIOU loss function

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李志欣;魏海洋;黄飞成;张灿龙;马慧芳;史忠植;: "结合视觉特征和场景语义的图像描述生成", 计算机学报, no. 09 *

Also Published As

Publication number Publication date
CN112966519B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN110021051B (en) Human image generation method based on generation of confrontation network through text guidance
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN110651276A (en) Tagging and online incremental real-time learning of data streams for deep neural networks and neural network applications
CN108416065A (en) Image based on level neural network-sentence description generates system and method
CN109670576B (en) Multi-scale visual attention image description method
CN105718952A (en) Method for focus classification of sectional medical images by employing deep learning network
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
Dandıl et al. Real-time facial emotion classification using deep learning
CN113343705A (en) Text semantic based detail preservation image generation method and system
He et al. Open-vocabulary multi-label classification via multi-modal knowledge transfer
CN110347857A (en) The semanteme marking method of remote sensing image based on intensified learning
CN114998220A (en) Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment
Gomez et al. Selective style transfer for text
Strudel et al. Weakly-supervised segmentation of referring expressions
CN113283336A (en) Text recognition method and system
Wang et al. Head: Hetero-assists distillation for heterogeneous object detectors
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
Choi CNN output optimization for more balanced classification
Lu et al. Prediction calibration for generalized few-shot semantic segmentation
Zhu et al. Unpaired image captioning by image-level weakly-supervised visual concept recognition
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Chen et al. Code generation from a graphical user interface via attention-based encoder–decoder model
Zhu et al. Dual-decoder transformer network for answer grounding in visual question answering
CN108108652A (en) A kind of across visual angle Human bodys' response method and device based on dictionary learning
CN112966519B (en) Phrase positioning method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant