CN114743029A - Image text matching method - Google Patents

Image text matching method Download PDF

Info

Publication number
CN114743029A
CN114743029A CN202210394752.5A CN202210394752A CN114743029A CN 114743029 A CN114743029 A CN 114743029A CN 202210394752 A CN202210394752 A CN 202210394752A CN 114743029 A CN114743029 A CN 114743029A
Authority
CN
China
Prior art keywords
image
text
similarity
matching
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210394752.5A
Other languages
Chinese (zh)
Inventor
乔壮
金福生
袁野
王国仁
马波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210394752.5A priority Critical patent/CN114743029A/en
Publication of CN114743029A publication Critical patent/CN114743029A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an image text matching method, which comprises the following steps: respectively extracting multi-scale image features and text features by using a deep neural network; calculating global similarity of a plurality of scales by using a self-attention mechanism method; performing node level matching and structure level matching, and calculating local similarity of multiple scales by using a structure level matching result; calculating final total similarity according to the calculated global similarity and local similarity of the multiple scales; performing supervised learning model training, and updating model parameters according to a loss function to obtain a trained model; and calculating the final total similarity of the input image and the text by using the trained model, and obtaining the matched text or image according to the final total similarity. The method can more quickly find out the matched image or text on the premise of finishing the network training; and more accurate results are obtained by utilizing the multi-scale characteristics of the original data and a more comprehensive similarity measurement method.

Description

Image text matching method
Technical Field
The invention relates to the technical field of intelligent matching, in particular to a method for matching one mode with another mode in multi-mode data of images and texts.
Background
In daily life, especially in accessing internet resources, images and text are two very important modalities. The search of the image usually cannot be separated from the text description, and the matching of the image to the text similar to the semantic information thereof also has a guiding function for searching similar images, so that the image text matching, namely a method for calculating the similarity between the image and the text, is very important. The traditional method not only needs to store a large amount of relation data except the data, but also cannot well deal with the data in dynamic change, and each calculation is long in time consumption, and the method is not suitable for scenes with real-time requirements. The image text matching method based on deep learning includes respectively extracting features of an image and a text, conducting similarity calculation after the features are aligned, and after supervised training of a large number of data sets is conducted, an obtained model can quickly and accurately find out the text matched with the image or the image matched with the text in data to be processed. At present, a lot of technologies exist in deep learning methods for image text matching, but basically, the problem that the matching of the whole and the local cannot be considered comprehensively exists, and little attention is paid to the situation that the size difference of key targets in an image is large.
Therefore, how to provide a method for matching multi-scale features and more comprehensive similarity measurement between images and texts is a problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method for extracting image features and text features based on deep learning and multi-modal learning, and performing overall and local similarity matching.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for matching image texts comprises a model building step and a model matching step:
the step of constructing the model comprises:
s1, extracting multi-scale image features and text features respectively by using a deep neural network, wherein the image and the text are data with matching labels;
s2, calculating the global similarity of the image multi-scale features and the text multi-scale features in multiple scales by using a self-attention mechanism method;
s3, sequentially performing node level matching and structure level matching between the image multi-scale features and the text multi-scale features, and calculating local similarity of multiple scales by using a structure level matching result;
s4, calculating the final total similarity according to the calculated global similarity and the local similarity of the scales;
s5, repeating S2-S4 to train the supervised learning model, and updating model parameters according to the loss function to obtain a trained model;
model matching:
and S6, calculating the final total similarity of the input image and the text by using the trained model, and obtaining the matched text or image according to the final total similarity.
Preferably, the S1 includes the image multi-scale feature extraction step:
inputting the image into a Faster R-CNN model, and identifying to obtain detection frames of a plurality of image areas;
coding the image region corresponding to the detection frame by using a ResNet model to obtain the extracted visual features;
and respectively enabling the visual features to pass through a full-connection layer with output dimensionality of N scales, wherein the output matrix is the multi-scale features of the image, and N belongs to N +.
Preferably, the S1 includes a text multi-scale feature extraction step:
by endowing each word with a unique serial number, performing word-level tokenization on the sentence to obtain a sentence vector;
embedding the sentence vectors into a multidimensional vector space and then obtaining text feature vectors through bidirectional GRUs; the bidirectional GRUs are provided with N text feature vectors of N scales.
Preferably, the node-level matching in S3 is a step of matching the image with a word in a sentence, and includes:
fusing the image features T with the text features V, and then calculating a softmax function along the visual axis to obtain a similarity matrix, namely W ═ softmax (lambda TV)T) λ is a temperature parameter;
integrating all image features into weighted image features by taking the similarity matrix as a weight parameter;
and calculating the similarity of the weighted image features and the text features by using a cosine similarity measurement method to obtain a node-level matching result.
Preferably, the structure-level matching in S3 is a step of matching the image with a phrase in a sentence, and includes:
analyzing semantic dependency among words in a text sentence, constructing text characteristics into an undirected sparse graph, and recording word nodes with dependency as adjacent nodes;
calculating to obtain a similarity matrix between word representations, and obtaining the weight of edges in the undirected sparse graph;
and integrating the neighborhood matching vectors represented by the similarity matrix by using a graph convolution neural network, and updating the nodes in the similarity matrix.
Preferably, in S3, the local phase is calculated by using the structure level matching resultThe similarity specifically comprises: sending the result of the graph convolution neural network into a full connection layer and calculating l2Local similarity can be obtained by norm.
Preferably, the S4 includes:
splicing the global similarity and the local similarity into a matrix, and sending the matrix into a full-connection layer to obtain the total similarity under one scale;
and carrying out weighted summation on the total similarity under multiple scales to obtain the final total similarity.
Preferably, the model output of S6 is: one row number is equal to the number of images/texts, the corresponding column number is equal to the number of texts/number of images matrix, each value in the matrix represents the similarity calculated by the image/text represented by the row of the value and the text/image represented by the column of the value.
Through the technical scheme, compared with the prior art, the invention has the beneficial effects that:
the method can more quickly find out the matched image or text on the premise of finishing the network training; compared with the existing deep learning technology, the method can obtain more accurate results by utilizing the multi-scale features of the original data and a more comprehensive similarity measurement method.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts;
fig. 1 is a flowchart of a method for matching image texts according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of training data provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of image region target detection according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating text results according to image matching provided by an embodiment of the present invention;
fig. 5 is a schematic diagram of an image result matching according to a text according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to the attached figure 1, the invention provides an image text matching method, which is based on multi-scale feature extraction and multi-scale similarity calculation and specifically comprises a model building step and a model matching step:
the model building step comprises:
and S1, extracting multi-scale image features and text features respectively by using the deep neural network, wherein the image and the text are data with matching labels.
In the specific implementation of the step, the purpose of matching and labeling the images and the texts is to establish a training data set, the training data set represents training texts which are related to the training images in a similarity manner and used for supervision training of the model, each image comprises 5 descriptive sentences, and the corresponding example of the images and the texts is shown in fig. 2.
In one embodiment, the image multi-scale feature extraction step is as follows:
inputting the image into a Faster R-CNN model, and identifying to obtain a plurality of detection frames of the image area;
coding the image area corresponding to the detection frame by using a ResNet model to obtain the extracted visual characteristics;
and respectively enabling the visual features to pass through a full-connection layer with output dimensionality of N scales, wherein the output matrix is the multi-scale features of the image, and N belongs to N +.
When the method is specifically executed: for image features, 36 image regions including entities and attributes are identified by using a Faster R-CNN model pre-trained on Visual Genomes, as shown in FIG. 3, the upper two subgraphs are respectively an original image and all identified image regions, and the lower six subgraphs are six samples selected from the original image and the identified regions include entities such as shadows and hairs, and attributes such as colors. And then, coding the image regions corresponding to the obtained 36 detection frames by using a ResNet model to obtain the extracted visual features. In order to obtain the multi-scale features, the visual features are respectively passed through full-connection layers with output dimensions of 128, 256, 512 and 1024 (namely four dimensions), and the output matrix is the image multi-scale features.
In one embodiment, the text multi-scale feature extraction step is as follows:
by endowing each word with a unique serial number, performing word-level tokenization on the sentence to obtain a sentence vector;
embedding the sentence vectors into a multidimensional vector space and then obtaining text feature vectors through a bidirectional GRU; the number of the bidirectional GRUs is N, and text feature vectors of N scales are output.
When executed in detail: for text features, the sentence needs to be tokenized at word level first. All words in the data set are counted, a unique number is given to each word, and a word-number index table is established. Sentences can be coded into a number matrix by replacing words in the sentences with numbers, and corresponding numbers for beginning and ending need to be added to the beginning and the end of each sentence due to the subsequent use of the RNN network. And embedding each sentence vector into a 300-dimensional vector space, then obtaining two coded vectors through a bidirectional GRU (generalized regression Unit), and obtaining a text feature vector after solving the average value of the two vectors. For matching with image multi-scale features, there are 4 bidirectional GRU networks, again with output dimensions of 128, 256, 512 and 1024.
And S2, calculating the global similarity of the image multi-scale features and the text multi-scale features in multiple scales by using a self-attention mechanism method.
In one embodiment, for visual and textual features, it is first necessary to take advantage of the self-attention mechanismTo calculate the global feature vectors, respectively. The most common method for calculating the similarity of two vectors is to use a cosine similarity measure, but in order to enable the model to learn more complex matches, a similarity calculation method with parameters is used, i.e. the method uses a cosine similarity measure
Figure BDA0003596931840000051
x and y represent visual and textual features, respectively, W is a learnable parameter, | · Y2Represents element by element squared, | · |2Is represented by2Norm, the result obtained, i.e. global similarity.
And S3, sequentially carrying out node level matching and structure level matching between the image multi-scale features and the text multi-scale features, and calculating the local similarity of multiple scales by using the structure level matching result.
The local similarity calculation can be divided into two parts of node level matching and structure level matching.
In one embodiment, the node level matching is a step of matching an image with a word in a sentence, including:
fusing the image features T with the text features V, and then calculating a softmax function along the visual axis to obtain a similarity matrix, namely W ═ softmax (lambda TV)T) λ is a temperature parameter, and the obtained similarity matrix result can measure the similarity degree of each image to each text;
integrating all image features into weighted image features by taking the similar matrix as a weight parameter;
and calculating the similarity of the weighted image features and the text features by using a cosine similarity measurement method to obtain a node-level matching result.
In one embodiment, the structure-level matching is a step of matching the image with phrases in the sentence, and the structure-level matching takes the result of the node-level matching as input to learn fine-grained phrase correspondence. The method comprises the following steps:
analyzing semantic dependency among words in a sentence by using a Stanford CoreNLP toolkit, constructing text characteristics into an undirected sparse graph, and recording word nodes with dependency relationship as adjacent nodes; the adjacency matrix of each node, which is also adjacent to itself, is represented using a matrix.
The similarity matrix between word representations can be calculated by using the softmax function and is marked as sij=softmax(λuiuj),ui、ujRespectively representing words i and j to obtain the weight of the edge in the undirected sparse graph, wherein the weight of the edge in the sparse graph can be obtained through l2-norm solving;
and integrating the neighborhood matching vectors represented by the similarity matrix by using a graph convolution neural network, and updating the nodes in the similarity matrix.
In one embodiment, the calculating the local similarity using the structure-level matching result specifically includes: sending the result of graph convolution neural network into full connection layer and calculating l2Local similarity can be obtained by norm.
And S4, calculating the final total similarity according to the calculated global similarity and the local similarity of the multiple scales.
In one embodiment, the global similarity and the local similarity are spliced into a matrix and then sent to a full connection layer to obtain the total similarity under a scale, and the splicing rule is that a (1, n) -dimensional vector is added to the (n, n) -dimensional matrix and then becomes the (n +1, n) -dimensional matrix; and carrying out weighted summation on the total similarity under multiple scales to obtain the final total similarity. Wherein the weights assigned to the four scales in this embodiment are 0.125, 0.25, 0.5, and 1, respectively.
And S5, repeating S2-S4 to train the supervised learning model, and updating model parameters according to the loss function to obtain the trained model.
The loss function employed in this embodiment is a triplet loss. When the type of the query is an image I, sampling in the matched text to obtain T, and sampling in the unmatched text to obtain T'; symmetrically, for the query as text T, sampling results in matched image I and unmatched image I'. In order to minimize the loss function, which can achieve the purpose of high similarity score of matched image-text pair and low similarity score of unmatched image-text pair, the loss function is designed as follows:
Figure BDA0003596931840000071
s (,) represents a similarity calculation model established by the embodiment of the invention, [ · C]+This indicates that max (·,0) operation was performed on the expression, and α is a margin value of 0.2. The method adopted by the embodiment is supervised learning, so that a large amount of supervised data, namely data with matched labels on the image and the text, is required during training. And when the second step to the fourth step are repeatedly executed each time, processing a part of data according to the size of the gpu display memory of the training machine, circularly updating the model parameters according to the back propagation of the loss function each time, sending all the data to the network once, recording the data as a circle once, and stopping training until reaching the specified cycle. In the present embodiment, the number of images per training process is set to 64, training is continued for a total of 40 rounds, and the dynamic learning rate is used to prevent learning instability, and the loss function is set to 0.002 in the first 30 rounds and 0.0002 in the last 10 rounds. After each round of training is finished, the model is verified on a small labeled data set (verification set), and finally, the model which is best in performance on the verification set is selected as a final training result after the training is finished.
Model matching:
and S6, calculating the final total similarity of the input image and the text by using the trained model, and obtaining the matched text or image according to the final total similarity.
In one embodiment, the model is tested to simulate the behavior of the model during actual use. Selecting images and texts outside the training set as model input, and outputting the model as follows: one row number is equal to the number of images/texts, the corresponding column number is equal to the number of texts/number of images matrix, each value in the matrix represents the similarity calculated by the image/text represented by the row of the value and the text/image represented by the column of the value. If a text matched with a certain image is to be calculated, the rows corresponding to the images in the matrix can be sorted, the text represented by the column with the largest value is the calculated text with the best match, and meanwhile, the image matched with the certain text can also be calculated.
In order to prove the superiority of the matching effect of the invention, the following comparison description is carried out by combining the specific image text matching result:
fig. 4 shows two examples of matching texts according to images, and five sentences respectively represent the calculated texts most similar to the image, and it can be seen that, using the matching result of the present invention, the highest similarity is calculated between the successful and correct texts of both images, and the fifth sentence of the second image which is wrong can also prove that the present invention successfully captures the entity "dog".
Fig. 5 shows two examples of matching images according to texts, because one sentence in the label of the data set corresponds to only one image, the remaining four images are all marked as errors except that the image corresponding to the label is marked as correct, but it can also be found that "toilet", "bat", "dirty" or "wire bottom" in the remaining images corresponds to texts.
The method for matching image texts provided by the present invention is described in detail above, a specific example is applied in this embodiment to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this embodiment may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method for matching image texts is characterized by comprising a model building step and a model matching step:
the step of constructing the model comprises:
s1, extracting multi-scale image features and text features respectively by using a deep neural network, wherein the image and the text are data with matching labels;
s2, calculating the global similarity of the image multi-scale features and the text multi-scale features in multiple scales by using a self-attention mechanism method;
s3, sequentially performing node level matching and structure level matching between the image multi-scale features and the text multi-scale features, and calculating local similarity of multiple scales by using a structure level matching result;
s4, calculating the final total similarity according to the calculated global similarity and the local similarity of the scales;
s5, repeating S2-S4 to train the supervised learning model, and updating model parameters according to the loss function to obtain a trained model;
model matching:
and S6, calculating the final total similarity of the input image and the text by using the trained model, and obtaining the matched text or image according to the final total similarity.
2. The method for image text matching according to claim 1, wherein said S1 includes the steps of image multi-scale feature extraction:
inputting the image into a Faster R-CNN model, and identifying to obtain detection frames of a plurality of image areas;
coding the image region corresponding to the detection frame by using a ResNet model to obtain the extracted visual features;
and respectively enabling the visual features to pass through a full-connection layer with output dimensionality of N scales, wherein the output matrix is the multi-scale features of the image, and N belongs to N +.
3. The method for image text matching according to claim 1, wherein said S1 includes a text multi-scale feature extraction step:
by endowing each word with a unique serial number, performing word-level tokenization on the sentence to obtain a sentence vector;
embedding the sentence vectors into a multidimensional vector space and then obtaining text feature vectors through a bidirectional GRU; the bidirectional GRUs are provided with N text feature vectors of N scales.
4. The image text matching method according to claim 1, wherein the node level matching in S3 is a step of matching an image with a word in a sentence, and comprises:
fusing the image features T with the text features V, and then calculating a softmax function along the visual axis to obtain a similarity matrix, namely W ═ softmax (lambda TV)T) λ is a temperature parameter;
integrating all image features into weighted image features by taking the similarity matrix as a weight parameter;
and calculating the similarity of the weighted image features and the text features by using a cosine similarity measurement method to obtain a node-level matching result.
5. The method for matching image texts according to claim 1, wherein the structure level matching in S3 is a step of matching an image with a phrase in a sentence, and includes:
analyzing semantic dependency among words in a text sentence, constructing text characteristics into an undirected sparse graph, and recording word nodes with dependency as adjacent nodes;
calculating to obtain a similarity matrix between word representations, and obtaining the weight of an edge in an undirected sparse graph;
and integrating the neighborhood matching vectors represented by the similarity matrix by using a graph convolution neural network, and updating the nodes in the similarity matrix.
6. The image text matching method according to claim 1, wherein the calculating of the local similarity using the structure-level matching result in S3 specifically includes: the result of the graph convolutional neural networkSend into the full connection layer and calculate l2And obtaining the local similarity by the norm.
7. The method for image text matching according to claim 1, wherein said S4 includes:
splicing the global similarity and the local similarity into a matrix, and sending the matrix to a full connection layer to obtain the total similarity under one scale;
and carrying out weighted summation on the total similarity under multiple scales to obtain the final total similarity.
8. The method for image text matching according to claim 1, wherein the model output of S6 is: one row number is equal to the number of images/texts, the corresponding column number is equal to the number of texts/number of images matrix, each value in the matrix represents the similarity calculated by the image/text represented by the row of the value and the text/image represented by the column of the value.
CN202210394752.5A 2022-04-14 2022-04-14 Image text matching method Pending CN114743029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210394752.5A CN114743029A (en) 2022-04-14 2022-04-14 Image text matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210394752.5A CN114743029A (en) 2022-04-14 2022-04-14 Image text matching method

Publications (1)

Publication Number Publication Date
CN114743029A true CN114743029A (en) 2022-07-12

Family

ID=82281802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210394752.5A Pending CN114743029A (en) 2022-04-14 2022-04-14 Image text matching method

Country Status (1)

Country Link
CN (1) CN114743029A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115730878A (en) * 2022-12-15 2023-03-03 广东省电子口岸管理有限公司 Goods import and export inspection management method based on data identification
WO2024045866A1 (en) * 2022-08-31 2024-03-07 Huawei Technologies Co., Ltd. System and method for cross-modal interaction based on pre-trained model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061250A1 (en) * 2015-08-28 2017-03-02 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061250A1 (en) * 2015-08-28 2017-03-02 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024045866A1 (en) * 2022-08-31 2024-03-07 Huawei Technologies Co., Ltd. System and method for cross-modal interaction based on pre-trained model
CN115730878A (en) * 2022-12-15 2023-03-03 广东省电子口岸管理有限公司 Goods import and export inspection management method based on data identification
CN115730878B (en) * 2022-12-15 2024-01-12 广东省电子口岸管理有限公司 Cargo import and export checking management method based on data identification

Similar Documents

Publication Publication Date Title
WO2023093574A1 (en) News event search method and system based on multi-level image-text semantic alignment model
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN111401928B (en) Method and device for determining semantic similarity of text based on graph data
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
CN114743029A (en) Image text matching method
CN115080766B (en) Multi-modal knowledge graph characterization system and method based on pre-training model
US12032605B2 (en) Searchable data structure for electronic documents
CN109933792A (en) Viewpoint type problem based on multi-layer biaxially oriented LSTM and verifying model reads understanding method
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN112632250A (en) Question and answer method and system under multi-document scene
CN114254093A (en) Multi-space knowledge enhanced knowledge graph question-answering method and system
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
CN116186256A (en) New concept sense original prediction method based on network embedding and pre-training model
CN115730232A (en) Topic-correlation-based heterogeneous graph neural network cross-language text classification method
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
CN114036246A (en) Commodity map vectorization method and device, electronic equipment and storage medium
CN118093860A (en) Multi-level scientific research topic mining method based on text embedded vector clustering
CN116720519B (en) Seedling medicine named entity identification method
CN110334204B (en) Exercise similarity calculation recommendation method based on user records
CN116386895B (en) Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
Xu et al. Research on depression tendency detection based on image and text fusion
CN116680407A (en) Knowledge graph construction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination