CN111428801A - Image-text matching method for improving alternate updating of fusion layer and loss function - Google Patents

Image-text matching method for improving alternate updating of fusion layer and loss function Download PDF

Info

Publication number
CN111428801A
CN111428801A CN202010236904.XA CN202010236904A CN111428801A CN 111428801 A CN111428801 A CN 111428801A CN 202010236904 A CN202010236904 A CN 202010236904A CN 111428801 A CN111428801 A CN 111428801A
Authority
CN
China
Prior art keywords
features
fusion
text
image
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010236904.XA
Other languages
Chinese (zh)
Other versions
CN111428801B (en
Inventor
程述立
汪烈军
杜安钰
王德鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang University
Original Assignee
Xinjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang University filed Critical Xinjiang University
Priority to CN202010236904.XA priority Critical patent/CN111428801B/en
Publication of CN111428801A publication Critical patent/CN111428801A/en
Application granted granted Critical
Publication of CN111428801B publication Critical patent/CN111428801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention provides a graph-text matching method for improving alternate updating of a fusion layer and a loss function, which comprises the following steps: establishing a fusion layer, and extracting fusion characteristics between the image characteristics and the text characteristic data through the fusion layer; embedding and coding the image features and the text features based on the similarity of the fusion features with the image features and the text features respectively; and alternately updating the strategy based on an original loss function consisting of the original features and a fusion loss function consisting of the fusion features to optimize the original loss and the fusion loss. The method simultaneously considers robust feature extraction and model optimization, introduces a fusion layer to obtain robust features, and optimizes network parameters based on alternately updating loss functions before and after fusion.

Description

Image-text matching method for improving alternate updating of fusion layer and loss function
Technical Field
The invention mainly relates to the crossing field of computer vision and natural language processing, is applied to a cross-modal retrieval task in each large search engine, and particularly relates to a method for improving image-text matching of a fusion layer and loss function alternate updating.
Background
With the explosive growth of multimedia data from heterogeneous search engines and social media, teletext matching has become the dominant algorithm in cross-modal search in recent years.
Unlike image retrieval and text retrieval single-modality retrieval tasks, image-text matching tasks focus on data of two modalities simultaneously, trying to find a matching relationship between image data and corresponding text data. The final goal of image-text matching is to find a bridge connecting the two modes of image and text. Through the bridge, the images (I2T) can be caused to retrieve corresponding texts, and the texts (T2I) can be caused to retrieve corresponding images. Image-text matching requires finding the relationship between image features and text semantics. Since image and text data are two different data forms, it is a great challenge to design a compact, robust and efficient image-text matching method.
In order to solve this problem, the existing methods can be divided into two types according to different modeling manners. One is based on the idea of classification, the matching problem is solved by optimizing logistic regression loss, the matching is marked as +1, the mismatching is marked as-1, and the matching problem is converted into a binary classification problem. However, this method is not sufficient to solve the complex multimodal problem and it is difficult to obtain good results. The algorithm idea does not meet the essential requirements of the image-text matching problem. The second method is based on the idea of embedding, i.e. by means of image and text data, the dataforms of both modalities are embedded in a common representation space, and then the degree of matching between the image and the text is described by means of the euclidean or cosine distance. Specifically, first, image features and text features are encoded, and then the triplet rank penalty is optimized such that the distance between mutually matching image and text features is less than the distance between unmatched image and text features. However, finding a suitable common space for both image and text modality data is not easy. The complexity of this approach is typically high and requires a large amount of computational resources to train. Moreover, current embedding methods often ignore the relationship between image and text features, and thus do not well construct a common space.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a graph-text matching method for improving the alternate updating of a fusion layer and a loss function by combining the prior art and starting from practical application.
The technical scheme of the invention is as follows:
the image-text matching method for improving alternate updating of a fusion layer and a loss function is characterized by comprising the following steps:
establishing a fusion layer, and extracting fusion characteristics between the image characteristics and the text characteristic data through the fusion layer;
embedding and coding the image features and the text features based on the similarity of the fusion features with the image features and the text features respectively;
and alternately updating the strategy based on an original loss function consisting of the original features and a fusion loss function consisting of the fusion features to optimize the original loss and the fusion loss.
Further, the extracting, by the fusion layer, fusion features between the image features and the text feature data includes:
and extracting image features based on a Faster R-CNN model and a ResNet-101 model, and extracting text features based on a bi-GRU model.
Further, the embedding and encoding the image feature and the text feature based on the similarity of the fusion feature with the image feature and the text feature respectively includes:
the image features or text features are re-encoded based on the distance between the image features and the text features and the fused features such that, in the encoding stage, the distance between the matched features is less than the distance between the unmatched features.
Further, the extraction of the fusion features and the embedding of the features specifically include:
characterizing an image as
Figure BDA0002431309460000031
Wherein xiRepresenting an encoding of an area within an image; representing text features as
Figure BDA0002431309460000032
Wherein y isiRepresenting the encoding of words in a sentence; connecting the image features and the text features to generate fused features:
Figure BDA0002431309460000033
Figure BDA0002431309460000034
determining an effect of the fused features on the image features and the text features based on the cosine evaluation score:
Figure BDA0002431309460000035
Figure BDA0002431309460000036
wherein Score _ im represents the cosine evaluation Score of the fused feature to the image feature, and Score _ txt represents the cosine evaluation Score of the fused feature to the text feature;
multiplying the fusion features by the cosine evaluation scores to respectively obtain preliminary fusion features related to the image and the text features, wherein the preliminary fusion features form final fusion features after passing through a normalization layer and a full-connection layer:
Fusion_to_image=dn(In(F*Score_im)) (3)
Fusion_to_text=dn(Ln(F*Score_txt)) (4)
wherein, Fusion _ to _ image represents the Fusion feature of the image, and Fusion _ to _ text represents the Fusion feature of the text.
Further, the fusion features of the image and the fusion features of the text are spliced with the image features and the text features respectively, and the embedded features are obtained through a layer of full connection and a layer of normalization layer:
EmbaddingText=Ln(dn(Fusion_to_text||Y)) (5)
EmbaddingImage=In(dn(Fusion_to_inage||X)) (6)
wherein EmbaddingText represents a text embedding feature, and EmbaddingImage represents an image embedding feature.
Further, after the feature embedding is completed, model optimization is performed, and a fusion loss function and an original loss function are used for alternately updating the gradient, so that the fusion loss can continuously and effectively reduce the gradient, and the method specifically comprises the following steps:
the fusion loss function is represented as F L (x)i,yi) The original loss function is represented as O L (x)i,yi) The gradient is expressed as:
Figure BDA0002431309460000041
Figure BDA0002431309460000042
Figure BDA0002431309460000043
a gradient representing the loss of fusion is obtained,
Figure BDA0002431309460000044
a gradient representing the original loss, t being the number of iteration steps during training; when the step number is even, using a fusion loss function; when the step number is odd, using the original loss function; the original loss is composed of image features and text features directly or indirectly related to the common space, the fusion loss is composed of the finally embedded features, and the updating of the fusion loss is expressed as:
Figure BDA0002431309460000045
the invention has the beneficial effects that:
the present invention uses a fusion layer to reduce the difference between the two modality data of image and text and respect the respective characteristics of the image feature and the text feature. Extracting the relation between image features and text features based on a fusion layer, emphasizing the difference between matched features and unmatched features, extracting the image features by using a Faster R-CNN and ResNet-101 model, and extracting the text features by using a bi-GRU; then, the image features and the text features are input into the fusion layer to extract the fusion features and embed the image features and the text features, finally, a unique gradient updating method is designed to optimize the original triple loss function and the triple loss function after fusion, the mode belongs to an alternative updating strategy, and experimental cases prove the effectiveness of the method.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a block diagram of a matching system of the present invention.
FIG. 3 is a schematic view of the structure of the fusion layer of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.
As shown in fig. 1, 2 and 3, the related flow and system structure diagram of the method for matching graphics and text for improving alternate update of fusion layer and loss function provided by the invention.
In the teletext matching method, the essence of constructing a reasonable common space is to construct a multimodal feature representation space so that the distance between mutually matching features and unmatched features is large. In order to establish a reasonable common space, the invention designs a fusion layer, namely a fusion layer. The fusion layer first extracts fusion features between the data of the image and the text, i.e., fusion features, including relationship features between the image and the text. The image or text features are recoded according to the distance between the image or text features and the fusion features, so that the distance between the matched features is smaller than the distance between the unmatched features in the coding stage.
In order to solve the problem of complex calculation of an embedding mode, the invention designs a new and simple loss function updating method, so that the network model is easier to train, and the robustness of the algorithm is improved. Specifically, the invention reserves the original embedding characteristics when the fusion layer is not used and also reserves the fusion embedding characteristics after the fusion layer is processed. Although the fused embedded feature is more accurate in description than the original embedded feature, it needs to consider 3 features, but the original feature needs to consider only 2 features, so the triplet rank penalty (original penalty) composed of the original features is more complicated than the triplet rank penalty (fusion penalty) composed of the fused features from the neural network perspective. Therefore, the present invention uses the fusion loss and the original loss to update alternately to better solve the complex problem by optimizing the simple problem.
The fusion layer and loss function optimization method of the present invention will be described in detail below.
The present invention fuses text features and image features using a fusion layer to form a fusion function. And then embedding and coding the image features and the text features according to the similarity of the fusion features and the image features and the text features respectively. By fusing features, the gap between multimodal data is reduced and a reasonable public space is established. Finally, the invention optimizes the original loss and the fusion loss through an alternate update strategy, and the model structure of the invention is shown in figure 2.
Fig. 2 presents an overview of the model. The invention embeds image features and text features using a fusion layer and updates the loss function of the invention alternately using two loss functions.
In the feature extraction stage, the invention adopts the Faster R-CNN and ResNet-101 models to extract image features and uses bi-GRU to extract text features. Then, the present invention inputs the image features and the text features into the fused layer to extract the fused features and embed the image features and the text features. The detailed structure of the fused layer is shown in fig. 3.
The fused layer structure shown in FIG. 3. The layer has completed extraction of the fused features and embedding of the features. Concat represents the feature concatenation operation and Sim represents the cosine similarity.
The present invention represents an image feature set as
Figure BDA0002431309460000061
Wherein xiRepresenting the coding of regions within an image, with text features set to
Figure BDA0002431309460000071
Wherein y isiRepresenting the encoding of words in a sentence. The present invention concatenates image features and text features to generate the fused features of the present invention
Figure BDA0002431309460000072
The invention then requires embedding image features and text features based on the fusion features. The present invention is intended not only to narrow the data gap of multimodal data but also to ensure the expression characteristics of image features and text features. Therefore, the present invention determines the impact of the fusion features on image and text features from the Cosine Evaluation Score (CES):
Figure BDA0002431309460000073
Figure BDA0002431309460000074
score im represents CES with fused features to image features and Score txt represents CES with fused features to text features. The fused features are then multiplied by CES to obtain preliminary fused features for the image and text features, respectively. The preliminary fused features form the final fused features of the invention after passing through the normalization layer and the full link layer:
Fusion_to_image=dn(In(F*Score_im)) (3)
Fusion_to_text=dn(Ln(F*Score_txt)) (4)
the present invention represents the fused features of the image as Fusion _ to _ image and the text as Fusion _ to _ text to prevent the gradient from disappearing, the present invention uses normalization layers, i.e., L ayeernorm (L n) and instancenorm (in)
EmbaddingText=Ln(dn(Fusion_to_text||Y)) (5)
EmbaddingImage=In(dn(Fusion_to_inage||X)) (6)
EmbaddingText is embedded text and Embaddingimage is embedded image.
Finally, Fusion _ to _ text and Fusion _ to _ image of the invention are respectively spliced with the image feature (X) and the text feature (Y), and the embedded feature of the invention is obtained through a layer of full connection and a layer of normalization layer.
In the optimization stage of the neural network, the optimization problem of the neural network often becomes very difficult due to the presence of saddle points and local minimai,yi) The original loss is denoted as O L (x)i,yi). The inventive gradient can then be expressed as:
Figure BDA0002431309460000081
Figure BDA0002431309460000082
Figure BDA0002431309460000083
a gradient representing the loss of fusion is obtained,
Figure BDA0002431309460000084
gradient representing original loss, t being during trainingThe number of iteration steps. When the step number is even, the invention uses the fusion loss; when the number of steps is odd, the invention uses the raw loss. The fusion loss is composed of the features that the invention ultimately embeds and is directly related to the composition of the public space of the invention. The raw loss consists of image features and text features that are directly or indirectly related to the common space of the present invention. In practice there are two loss functions describing the composition of the common space, but the description angles are different. The update of the fusion loss can be expressed as:
Figure BDA0002431309460000085
the loss function optimization method of the invention can enable the fusion loss to use the gradient information of the original loss when gradient descent is executed. Since both penalty functions describe the same problem, their optimization directions (i.e. gradient descent directions) are the same and the weight parameters can also be shared. Therefore, the performance of the model of the invention can be improved more stably during the training process.
The method provided by the invention is verified on Flickr30K and MSCOCO data sets.
The Flickr30K and MSCOCO datasets are widely used for image text matching and image retrieval tasks.
Flickr30K has 31000 pictures on the Flickr website, each picture having five corresponding sentences. The invention uses 1000 pictures for verification, 1000 pictures for testing and the rest for training. MS-COCO contains 123287 images, each corresponding to five textual descriptions, of which 113287 images were used as training set, 5000 images were used as validation set, and 5000 images were used as test set, and experiments demonstrated that the method of the present invention has certain advantages over the traditional method.
The task of image-text matching is to discuss the similarity between matching samples and the difference between non-matching samples, so the present invention focuses on the relationship between image and text features. The present invention uses a fusion layer to reduce the difference between the two modality data of image and text and respect the respective characteristics of the image feature and the text feature. Extracting the relation between the image feature and the text feature based on the fusion layer, emphasizing the difference between the matched feature and the unmatched feature, extracting the image feature by using a Faster R-CNN and ResNet-101 model, and extracting the text feature by using a bi-GRU. Then, the present invention inputs the image features and the text features into the fused layer to extract the fused features and embed the image features and the text features. Finally, the invention designs a unique gradient updating method to optimize the original triple loss function and the triple loss function after fusion, and the method belongs to an alternate updating strategy. The invention shows the superiority of the invention on Flickr30K and MSCOCO data sets.

Claims (6)

1. The image-text matching method for improving alternate updating of a fusion layer and a loss function is characterized by comprising the following steps:
establishing a fusion layer, and extracting fusion characteristics between the image characteristics and the text characteristic data through the fusion layer;
embedding and coding the image features and the text features based on the similarity of the fusion features with the image features and the text features respectively;
and alternately updating the strategy based on an original loss function consisting of the original features and a fusion loss function consisting of the fusion features to optimize the original loss and the fusion loss.
2. The method for improving teletext matching according to claim 1, wherein the extraction of the fused features between the image features and the text feature data through the fusion layer comprises:
and extracting image features based on a Faster R-CNN model and a ResNet-101 model, and extracting text features based on a bi-GRU model.
3. The method for improving teletext matching between a fusion layer and an alternate update of a loss function according to claim 1, wherein the embedded encoding of image features and text features based on the similarity of the fusion features to the image features and the text features respectively comprises:
the image features or text features are re-encoded based on the distance between the image features and the text features and the fused features such that, in the encoding stage, the distance between the matched features is less than the distance between the unmatched features.
4. The method for improving the teletext matching process for alternately updating the fusion layer and the loss function according to claim 1, wherein the extraction of the fusion features and the embedding of the features specifically comprise:
characterizing an image as
Figure FDA0002431309450000011
Wherein xiRepresenting an encoding of an area within an image; representing text features as
Figure FDA0002431309450000012
Wherein y isiRepresenting the encoding of words in a sentence; connecting the image features and the text features to generate fused features:
Figure FDA0002431309450000021
Figure FDA0002431309450000022
determining an effect of the fused features on the image features and the text features based on the cosine evaluation score:
Figure FDA0002431309450000023
Figure FDA0002431309450000024
wherein Score _ im represents the cosine evaluation Score of the fused feature to the image feature, and Score _ txt represents the cosine evaluation Score of the fused feature to the text feature;
multiplying the fusion features by the cosine evaluation scores to respectively obtain preliminary fusion features related to the image and the text features, wherein the preliminary fusion features form final fusion features after passing through a normalization layer and a full-connection layer:
Fusion_to_image=dn(In(F*Score_im)) (3)
Fusion_to_text=dn(Ln(F*Score_txt)) (4)
wherein, Fusion _ to _ image represents the Fusion feature of the image, and Fusion _ to _ text represents the Fusion feature of the text.
5. The method for matching graphics and text with improved fusion layer and alternate update of loss function as claimed in claim 4, wherein the fusion features of image and text are spliced with the image and text features respectively, and the embedded features are obtained by a full connection layer and a normalization layer:
EmbaddingText=Ln(dn(Fusion_to_text||Y)) (5)
EmbaddingImage=In(dn(Fusion_to_inage||X)) (6)
wherein EmbaddingText represents a text embedding feature, and EmbaddingImage represents an image embedding feature.
6. The method for improving the image-text matching of the fusion layer and the loss function in the alternative updating process according to claim 5, wherein after the feature embedding is completed, model optimization is performed, and the gradient is updated alternately by using the fusion loss function and the original loss function, so that the gradient can be continuously and effectively reduced by the fusion loss, specifically comprising:
the fusion loss function is represented as F L (x)i,yi) The original loss function is represented as O L (x)i,yi) The gradient is expressed as:
Figure FDA0002431309450000031
Figure FDA0002431309450000032
Figure FDA0002431309450000033
a gradient representing the loss of fusion is obtained,
Figure FDA0002431309450000034
a gradient representing the original loss, t being the number of iteration steps during training; when the step number is even, using a fusion loss function; when the step number is odd, using the original loss function; the original loss is composed of image features and text features directly or indirectly related to the common space, the fusion loss is composed of the finally embedded features, and the updating of the fusion loss is expressed as:
Figure FDA0002431309450000035
CN202010236904.XA 2020-03-30 2020-03-30 Image-text matching method for improving alternate updating of fusion layer and loss function Active CN111428801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010236904.XA CN111428801B (en) 2020-03-30 2020-03-30 Image-text matching method for improving alternate updating of fusion layer and loss function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010236904.XA CN111428801B (en) 2020-03-30 2020-03-30 Image-text matching method for improving alternate updating of fusion layer and loss function

Publications (2)

Publication Number Publication Date
CN111428801A true CN111428801A (en) 2020-07-17
CN111428801B CN111428801B (en) 2022-09-27

Family

ID=71551678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010236904.XA Active CN111428801B (en) 2020-03-30 2020-03-30 Image-text matching method for improving alternate updating of fusion layer and loss function

Country Status (1)

Country Link
CN (1) CN111428801B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861882A (en) * 2021-03-10 2021-05-28 齐鲁工业大学 Image-text matching method and system based on frequency self-adaption
CN113342168A (en) * 2021-06-10 2021-09-03 中国水利水电第七工程局有限公司 Multi-mode intelligent large-scale equipment mounting and dismounting training system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080002893A1 (en) * 2006-06-29 2008-01-03 Luc Vincent Recognizing text in images
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
CN107871014A (en) * 2017-11-23 2018-04-03 清华大学 A kind of big data cross-module state search method and system based on depth integration Hash
CN109215097A (en) * 2018-08-08 2019-01-15 深圳市唯特视科技有限公司 A kind of single image text condition embedding grammar based on end to end joint study
CN109711529A (en) * 2018-11-13 2019-05-03 中山大学 A kind of cross-cutting federal learning model and method based on value iterative network
CN110147457A (en) * 2019-02-28 2019-08-20 腾讯科技(深圳)有限公司 Picture and text matching process, device, storage medium and equipment
CN110222560A (en) * 2019-04-25 2019-09-10 西北大学 A kind of text people search's method being embedded in similitude loss function
CN110298338A (en) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 A kind of file and picture classification method and device
CN110298395A (en) * 2019-06-18 2019-10-01 天津大学 A kind of picture and text matching process based on three mode confrontation network
CN110472002A (en) * 2019-08-14 2019-11-19 腾讯科技(深圳)有限公司 A kind of text similarity acquisition methods and device
CN110889003A (en) * 2019-11-20 2020-03-17 中山大学 Vehicle image fine-grained retrieval system based on text
CN110909673A (en) * 2019-11-21 2020-03-24 河北工业大学 Pedestrian re-identification method based on natural language description

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080002893A1 (en) * 2006-06-29 2008-01-03 Luc Vincent Recognizing text in images
US20160342895A1 (en) * 2015-05-21 2016-11-24 Baidu Usa Llc Multilingual image question answering
CN107871014A (en) * 2017-11-23 2018-04-03 清华大学 A kind of big data cross-module state search method and system based on depth integration Hash
CN109215097A (en) * 2018-08-08 2019-01-15 深圳市唯特视科技有限公司 A kind of single image text condition embedding grammar based on end to end joint study
CN109711529A (en) * 2018-11-13 2019-05-03 中山大学 A kind of cross-cutting federal learning model and method based on value iterative network
CN110147457A (en) * 2019-02-28 2019-08-20 腾讯科技(深圳)有限公司 Picture and text matching process, device, storage medium and equipment
CN110222560A (en) * 2019-04-25 2019-09-10 西北大学 A kind of text people search's method being embedded in similitude loss function
CN110298395A (en) * 2019-06-18 2019-10-01 天津大学 A kind of picture and text matching process based on three mode confrontation network
CN110298338A (en) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 A kind of file and picture classification method and device
CN110472002A (en) * 2019-08-14 2019-11-19 腾讯科技(深圳)有限公司 A kind of text similarity acquisition methods and device
CN110889003A (en) * 2019-11-20 2020-03-17 中山大学 Vehicle image fine-grained retrieval system based on text
CN110909673A (en) * 2019-11-21 2020-03-24 河北工业大学 Pedestrian re-identification method based on natural language description

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
DEPENGWANG等: "Fusion layer attention for image-text matching", 《NEUROCOMPUTING》 *
KUANG-HUEI LEE ET AL: "Stacked Cross Attention for Image-Text Matching", 《EUROPEAN CONFERENCE ON COMPUTER VISION》 *
NIDHI GOEL等: "Weighted semantic fusion of text and content for image retrieval", 《IEEE》 *
WANG, T等: "Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking", 《PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 *
於利艳: "基于深度学习的图文匹配方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王德鹏: "图像文本匹配算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郝志峰等: "面向图文匹配任务的多层次图像特征融合算法", 《计算机应用研究》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861882A (en) * 2021-03-10 2021-05-28 齐鲁工业大学 Image-text matching method and system based on frequency self-adaption
CN113342168A (en) * 2021-06-10 2021-09-03 中国水利水电第七工程局有限公司 Multi-mode intelligent large-scale equipment mounting and dismounting training system
CN113342168B (en) * 2021-06-10 2023-09-22 中国水利水电第七工程局有限公司 Multi-mode intelligent large-scale equipment installation and disassembly training system

Also Published As

Publication number Publication date
CN111428801B (en) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN111160264B (en) Cartoon character identity recognition method based on generation countermeasure network
CN113065577A (en) Multi-modal emotion classification method for targets
CN110298395B (en) Image-text matching method based on three-modal confrontation network
CN111177366A (en) Method, device and system for automatically generating extraction type document abstract based on query mechanism
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN111428801B (en) Image-text matching method for improving alternate updating of fusion layer and loss function
CN115982350A (en) False news detection method based on multi-mode Transformer
CN115116066A (en) Scene text recognition method based on character distance perception
CN114791958B (en) Zero sample cross-modal retrieval method based on variational self-encoder
CN113051368B (en) Double-tower model training method, retrieval device and electronic equipment
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN115687571A (en) Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
CN116450834A (en) Archive knowledge graph construction method based on multi-mode semantic features
CN115238690A (en) Military field composite named entity identification method based on BERT
CN114969458A (en) Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance
CN114817596A (en) Cross-modal image-text retrieval method integrating semantic similarity embedding and metric learning
CN113807307A (en) Multi-mode joint learning method for video multi-behavior recognition
CN116561592B (en) Training method of text emotion recognition model, text emotion recognition method and device
CN115828931B (en) Chinese and English semantic similarity calculation method for paragraph level text
CN114416914B (en) Processing method based on picture question and answer
CN115984842A (en) Multi-mode-based video open tag extraction method
CN113902764A (en) Semantic-based image-text cross-modal retrieval method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant