CN110889505B - Cross-media comprehensive reasoning method and system for image-text sequence matching - Google Patents
Cross-media comprehensive reasoning method and system for image-text sequence matching Download PDFInfo
- Publication number
- CN110889505B CN110889505B CN201911127259.1A CN201911127259A CN110889505B CN 110889505 B CN110889505 B CN 110889505B CN 201911127259 A CN201911127259 A CN 201911127259A CN 110889505 B CN110889505 B CN 110889505B
- Authority
- CN
- China
- Prior art keywords
- text
- image
- cross
- reasoning
- media
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a cross-media comprehensive reasoning method and system for image-text sequence matching. The method comprises the following steps: 1. and establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text. 2. Training the image-text sequence matching deep network by using the cross-media implication reasoning data set, and using a cross-task cross-media migration training method during network training. 3. And (3) reasoning by using the trained image-text sequence matching network structure, and judging implication relation of the given premise and the conclusion. Compared with the prior art, the method can realize implication reasoning based on the image and text preconditions, and migrate the relevant knowledge implication by the retrieval task to the reasoning task network, thereby improving the reasoning accuracy.
Description
Technical Field
The invention relates to the field of multimedia analysis, in particular to a cross-media comprehensive reasoning method and system for image-text sequence matching.
Background
Humans can infer new conclusions from known facts, which is critical to perceiving the cognitive outside world. Implication reasoning is a basic and important reasoning form, and refers to judging whether a conclusion H and a premise P have implication relation or not, namely judging whether a logic expression 'P-H' is established or not. The method is a supporting technology in intelligent man-machine interaction, and has wide application value in intelligent question-answering, knowledge extraction and other applications. The existing method mainly takes research of text implication reasoning as a main part, namely conclusion H and premise P are both a piece of text. However, the human reasoning process is a cross-media comprehensive reasoning process involving senses such as vision, language and the like, and the prior art cannot deal with the reasoning process on the premise of various media (such as images and texts). How to break through the limitation of single media type, realize the comprehensive reasoning of multiple media premise participation, become the important problem of the extension reasoning breadth and depth.
The related technology of text implication reasoning mainly comprises two types of methods. The first type of method is rule-based reasoning, the idea being to try to transform the premises text into conclusion text by a given set of rules. Rule sets typically include inclusion, cause and effect relationships, such as dog→animal, buy→own, and the like. As the method proposed by Mirkin et al in the literature "Source-Language Entailment Modeling for Translating Unknown Terms", words and phrases in the premises and conclusions can be subjected to rule transformation to see if matching can be completed. Since rules are mainly defined manually, it is difficult to cover the reasoning phenomenon in the real world, so that the generalization of the rule-based reasoning method is not good. The second type of method is based on feature extraction and is the main stream method in the existing research. The basic idea is to extract text feature vectors from the preconditions and conclusions, and then judge the relationship between the two vectors through a classifier to be used as an reasoning result. For example, bowman et al in document "Large Annotated Corpus for Learning Natural Language Inference" propose to extract preconditions and conclusions as feature vectors using a recurrent neural network, and then to classify implications by full connection layer. In addition, because the sequence information is very important for judging the text implication relationship, the idea of text sequence matching is also applied to some existing methods, and the accuracy is improved. However, the above method is still limited to text reasoning, and cannot deal with implication reasoning on the premise of various media (such as images and text).
In the related technology of cross-media analysis, the main challenge is the problem of 'heterogeneous gap', which refers to the inconsistent characterization of different media such as images, texts and the like, so that the cross-media similarity is difficult to measure, and the semantic association is difficult to establish. The existing mainstream method is cross-media unified characterization, namely mapping different media to the same semantic space. For example, the high-level semantic mapping method proposed by Rasiwasia et al in the document A New Approach to Cross-Modal Multimedia Retrieval is to map the image and text into the same space by using a typical correlation analysis method according to the paired relation between the image and the text, and then obtain semantic characterization by using a logistic regression method according to the class labels of the training set. Bimodal depth self-encoder (Ngiam et al, publication "Multimodal Deep Learning") is a representative depth-across-media analysis model. Based on the depth Boltzmann machine, a self-encoder is respectively constructed for two media of audio and video, and the two self-encoders share the same encoding output layer and can capture the commonality semantics of the two self-encoders. However, the above methods focus on learning cross-media association relationships, so that cross-media similarity is calculated, and implication reasoning tasks cannot be supported.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-media comprehensive reasoning method and a system for matching image-text sequences. By modeling the implication reasoning task as a sequence matching problem and combining a training method of cross-task and cross-media migration, implication reasoning based on the premise of images and texts can be realized, associated knowledge implication by the retrieval task is migrated to an implication reasoning task network, and reasoning accuracy is improved.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a cross-media comprehensive reasoning method for image-text sequence matching is used for carrying out sequence coding on images and preconditions and conclusions of the images and carrying out sequence matching of fine-granularity reasoning clues so as to realize implication reasoning based on the preconditions of the images and the texts. The method comprises the following steps:
(1) Establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is a text;
(2) Training an image-text sequence matching deep network by using a cross-media implication reasoning data set, and using a cross-task cross-media migration training method during network training;
(3) And (3) reasoning by using the trained image-text sequence matching depth network, and judging implication relation of the given premise and the conclusion.
In the step (2), in the image-text sequence matching depth network, the image-text sequence matching task is modeled as a comprehensive sequence matching problem, and the comprehensive reasoning is realized by using the logic relationship possibly contained by three combinations of text-text, text-image and text-image-text. On the training method of cross-task and cross-media migration, the triplet measurement learning loss is used as the retrieval task loss, the associated knowledge contained in the retrieval task is migrated to an implication reasoning network (namely, an image-text sequence matching depth network), and the reasoning accuracy is improved.
Furthermore, according to the cross-media comprehensive reasoning method for matching the image-text sequences, the implication relationship in the step (3) is divided into implication, contradiction and irrelevant three types. The way to make the reasoning is: and simultaneously inputting an image premise, a text premise and a text conclusion, outputting predictive scores of three implication relations by the image-text sequence matching depth network, and taking the implication relation with the highest score as an output result.
Based on the same inventive concept, the invention also provides a cross-media comprehensive reasoning system for image-text sequence matching, which comprises:
the data set establishing module is responsible for establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text;
the network training module is responsible for training the image-text sequence matching deep network by using the cross-media implication reasoning data set, and a cross-task cross-media migration training method is used during network training;
and the reasoning module is used for reasoning by using the trained image-text sequence matching depth network and judging implication relation of the given premise and the conclusion.
The invention has the beneficial effects that: compared with the existing method, the method can comprehensively utilize reasoning clues in the images and the texts to carry out implication relation reasoning. In addition, the method can support model training of implication reasoning tasks by utilizing the association annotation of the cross-media retrieval tasks, and breaks through the problem of difficulty in implication relation annotation.
The present invention has the above-described effects because: the image and text preconditions and fine granularity information in the conclusions can be extracted through the image-text sequence matching depth network structure, and complementary cross-media local clues can be fully mined through the sequence matching of three combinations of text-text, text-image and text-image-text. By the training method of cross-task and cross-media migration, training information of the cross-media retrieval task is effectively utilized, and accuracy of implication reasoning is improved.
Drawings
FIG. 1 is a flow chart of a cross-media comprehensive reasoning method of image-text sequence matching in the invention.
Fig. 2 is a schematic diagram of a graph-text sequence matching depth network structure of the present invention.
FIG. 3 is a schematic diagram of a training method of cross-task and cross-media migration of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific embodiments.
The invention relates to a cross-media comprehensive reasoning method for matching image-text sequences, which is shown in a flow chart in figure 1 and comprises the following steps:
(1) And establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text.
In this embodiment, the established cross-media implication reasoning dataset is represented asWherein (i) n ,t n ) And h n For the nth premise-conclusion pair, N represents the total number of such premise-conclusion pairs. The image premise is denoted as i n The text precondition is denoted as t n The text conclusion is h n Implication relationship is marked as e n 。i n And t n Has an association relationship e n Is a 1 x 3 vector, with one dimension 1 and only, indicating one of three implications: implications, contradictions and independence. The other dimensions of the vector are all 0.
(2) Training the image-text sequence matching depth network by using the cross-media implication reasoning data set. The training method of cross-task and cross-media migration is used in network training.
Figure 2 shows a graph-text sequence matching depth network structure. In this embodiment, the input section includes three branches: image preconditions, text conclusions, text preconditions. For each image, extracting pool5 layer features of the convolutional neural network VGG19 as local features of the image, wherein the pool5 layer has v=7×7=49 regions in total, and each region is represented as 512-dimensional features; for each piece of input text, each english word extracts a 350-dimensional concatenation feature, consisting in particular of a 300-dimensional Glove feature, a 49-dimensional Character combination embedding (Character-composed embedding) feature, and a 1-dimensional Exact match feature. The local features of the three branches are all converted into local unified characterization of the same dimension through a fully connected layer (fc-I, fc-H, fc-T in FIG. 2) as input for the respective sequence codes. Specifically, let i n Is characterized as the local unification ofWherein->The 1 st local unified characterization; similarly, t n Is characterized by->h n Is characterized by->
In the sequence encoding portion, each branch encodes a local unified representation using a bi-directional long and short time memory network model (BiLSTM). Wherein, for an image, the viewing sequence of human eyes is simulated, 7×7=49 areas are arranged in the sequence from left to right and from top to bottom; for text, the text is arranged according to the sequence of words. Each local unified token can be converted into a local sequence token by a BiLSTM model. For the image precondition, the text precondition and the text conclusion, the local sequence characterization is respectively recorded asAnd->
As shown in fig. 2, the local sequence representations of the three branches are spliced by three combination modes of image-text, image-text and text-text, and are respectively input into DMAN (Delta Memory Attention Network) models (documents Memory fusion network for multi-view sequential learning, authors A.Zadeh, P.P.Liang, N.Mazumder, S.Poria, E.Cambria, and l. -p.molar, published in AAAI Conference on Artificial Intelligence 2018) to obtain context codes. In the three-way branch, the context code passes through a softmax classifier and the classification probabilities are added to obtain a final implication relation prediction score. In this embodiment, the model training adopts a training method of cross-task and cross-media migration, as shown in fig. 3. The cross-task migration occurs between a cross-media retrieval task and an implication reasoning task, and aims to migrate associated knowledge implication of the retrieval task to an implication reasoning task network. Since (in, tn) has a pair-wise association of an image and text, association retrieval model training can be supported. I is i n For example, all partial sequence characterizations are calculated using a Max pooling methodGlobal representation G (in) of (c), similarly yielding t n Is a global representation G (tn). The triplet metric learning penalty for the retrieval task is as follows:
wherein G (i) n ) + ,G(t n ) + For image, text preconditions with pairwise association, G (i n ) + ,G(t n ) - And G (t) n ) + ,G(i n ) - For images, text preconditions that do not have a pairwise association. max is the maximum value calculation to be performed,sim is a distance measurement operation, such as euclidean distance, and α is a boundary parameter. After the above-described network part training, the parameters are shared to the inference network.
The cross-media migration occurs in image preconditions branches and text preconditions branches of the implication inference network, with the goal of maintaining cross-media consistency in model training. Specifically, the maximum average difference (maximum mean discrepancy, MMD) is adopted as migration loss, so that a and b are respectively local unified characterization distribution of images and texts, and the MMD of the images and the texts is in the regenerated Hilbert space H k The definition of the middle square form is as follows:
wherein E is a And E is b Representing the desire for local unified characterization of images, text, x and j represent the image region and any superscript of text words, respectively. By minimizing the MMD described above, knowledge migration between images and text preconditions can be achieved.
(3) And (3) reasoning by using the trained image-text sequence matching depth network, and judging implication relation of the given premise and the conclusion.
After the deep network training is finished, the implication relationship of the preconditions and the conclusions can be judged according to the input image preconditions, text preconditions and text conclusions. Specifically, the network receives an image premise, a text premise and a text conclusion as input, and each network can obtain a probability value for three conditions of implication, contradiction and independence. And adding probability values of the three paths of networks to obtain an implication relation prediction score, and taking the largest item as an reasoning result.
The following experimental results show that compared with the existing method, the cross-media comprehensive reasoning method for matching the image-text sequences can obtain higher implication reasoning accuracy.
In this embodiment, the extended text implication reasoning dataset SNLI is used for experimental verification. The original SNLI dataset was from the university of Stanford and is proposed by literature "A large annotated corpus for learning natural language inference" (authors: samuel R. Bowman, gabor Angeli, christopher Potts, and Christopher D. Manning, conference on Empirical Methods on Natural Language Processing published 2015). It contains 570,000 premise-conclusion pairs, where each premise and conclusion is a piece of text. Each text in the SNLI dataset is a description of one image in the Flickr30k dataset, and so the SNLI dataset is extended with these images so that each premise contains both text and image media. On experimental comparison, the following 4 methods were tested:
the existing method comprises the following steps: the 100-d LSTM encoders method in document "A large annotated corpus for learning natural language inference" (authors: samuel R. Bowman, gabor Angeli, christopher Potts, and Christopher D. Manning). The method extracts feature vectors from preconditions and conclusions by using an independent cyclic neural network respectively, and analyzes implication relations through a plurality of fully-connected network layers to obtain reasoning results.
The existing method is as follows: biMPM method in literature "Bilatial Multi-Perspective Matching for Natural Language Sentences" (authors Zhiguo Wang, wael Hamza, and Radu Florian). The method introduces a sequence matching idea, matches the preconditions and the conclusions according to the forward direction and the reverse direction, and classifies implications.
The existing method is as follows: the DIIN method in document "Natural Language Inference over Interaction Space" (authors YIchen Gong, heng Luo, and Jian Zhang). Firstly, extracting features from text preconditions and conclusions, constructing an interaction tensor space, and modeling implication reasoning as vector interaction in the tensor space, thereby realizing implication reasoning.
The invention comprises the following steps: the method of this embodiment.
The Accuracy (Accuracy) is adopted as an effect evaluation index of implication reasoning, and the calculation method is a ratio of the number of conclusions to the total number, which is a precondition for judging the implication relation in the data set correctly. The higher the reasoning accuracy, the better the explanation effect.
TABLE 1 demonstration of the experimental results of the invention
Contrast method | Accuracy rate of |
Existing method one | 77.6% |
Second existing method | 86.9% |
Existing method III | 88.0% |
The invention is that | 90.0% |
As can be seen from Table 1, the present invention enables higher inference accuracy than the prior art methods. The existing method is an early method, only adopts simple structures such as a cyclic neural network and the like, and is difficult to fully model complex reasoning clues. The second method adopts the sequence matching idea, and has the advantage of accuracy compared with the first method, but only utilizes text information, so that the accuracy is difficult to further improve. The third method fully models the logic relation between the premise and the conclusion in the tensor interaction space, but only utilizes text information, and cannot provide complementary clues by utilizing image information. The invention realizes the sequence matching of three combinations of text-text, text-image and text-image-text by matching the image-text sequence with a depth network structure, and fully excavates complementary cross-media local clues; by means of the cross-task and cross-media migration training method, training information of the cross-media retrieval task is effectively utilized, and accuracy of implication reasoning is improved.
Based on the same inventive concept, another embodiment of the present invention provides a cross-media comprehensive reasoning system for matching of graphic sequences, which includes:
the data set establishing module is responsible for establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text;
and the network training module is responsible for training the image-text sequence matching deep network by using the cross-media implication reasoning data set. A training method of cross-task and cross-media migration is used in network training;
and the reasoning module is used for reasoning by using the trained image-text sequence matching depth network and judging implication relation of the given premise and the conclusion.
Based on the same inventive concept, another embodiment of the present invention provides a computer/server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
Other embodiments: in the previous embodiment, the images used VGG19 network features, and the text used Glove features, character-composed embedding features, and Exact match features. The method also supports other features as inputs, such as Word embedding (Word embedding) features of text, etc. In the previous embodiment, the BiLSTM model is used for the sequence encoding of images and text. The method also supports LSTM and other sequence coding models.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (10)
1. A cross-media comprehensive reasoning method for image-text sequence matching comprises the following steps:
establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is a text;
training an image-text sequence matching deep network by using a cross-media implication reasoning data set, and using a cross-task cross-media migration training method during network training; the image-text sequence matching depth network firstly extracts local characteristics from an image premise, a text premise and a text conclusion, then codes the representations of three combinations of text-text, text-image and text-image-text, and obtains an implication relation prediction score through self-adaptive threshold aggregation and implication relation classifier; in the training method of cross-task and cross-media migration, each group of image preconditions and text preconditions have a pair association relationship, a triplet measurement learning loss is adopted to train a retrieval task, network parameters are shared in an implication reasoning network part, and associated knowledge of the retrieval task is migrated to an implication reasoning task network; meanwhile, the maximum average difference is used as migration loss, the local uniform characterization distribution difference of the image and the text is minimized, and knowledge migration between the image and the text precondition is realized;
and (3) reasoning by using the trained image-text sequence matching depth network, and judging implication relation of the given premise and the conclusion.
2. The method for cross-media comprehensive reasoning of image-text sequence matching as claimed in claim 1, wherein each group of data in the cross-media implication reasoning data set is composed of an image premise, a text premise and a text conclusion and has implication relation labels.
3. The cross-media comprehensive reasoning method of the image-text sequence matching, as claimed in claim 1, wherein the image-text sequence matching depth network models the cross-media implication reasoning task as a comprehensive sequence matching problem, and utilizes the logical relations possibly contained by three combinations of text-text, text-image and text-image-text to realize comprehensive reasoning.
4. The method for cross-media comprehensive reasoning of image-text sequence matching as claimed in claim 3, wherein in the image-text sequence matching depth network, local features are extracted from image preconditions, text preconditions and text conclusions, and are converted into local unified characterization of the same dimension through a full connection layer; converting the local unified characterization into local sequence characterization through a sequence coding model respectively; for the image precondition, the text precondition and the text conclusion, the local sequence characterization is spliced in three combination modes of image-text, image-text and text-text, and the local sequence characterization is respectively input into a DMAN model to obtain a context code; and (3) carrying out self-adaptive threshold aggregation and implication relation classifier on the context codes, and adding the classification probabilities to obtain a final implication relation prediction score.
5. The method for cross-media comprehensive reasoning matched with image-text sequences according to claim 1, wherein in the training method for cross-task cross-media migration, the cross-task migration occurs between a cross-media retrieval task and an implication reasoning task, and the goal is to migrate associated knowledge implication of the retrieval task to an implication reasoning task network; in the cross-media implication reasoning data set, each group of image preconditions and text preconditions have a pair association relationship, the pair data are regarded as relevant, the non-pair data are regarded as irrelevant, the retrieval task training is carried out by adopting the triplet measurement learning loss, and then the network parameters are shared in the implication reasoning network part, so that the cross-task migration is realized.
6. The method for cross-media comprehensive reasoning of teletext sequence matching according to claim 5, wherein the triplet metric learning loss is calculated using the formula:
L Sim =max(0,α+sim(G(i n ) + ,G(t n ) + )-sim(G(i n )+,G(t n ) - ))+max(0,α+sim(G(i n ) + ,G(t n ) + )-sim(G(t n ) + ,G(i n ) - ))
wherein G (i) n )+,G(t n ) + For image, text preconditions with pairwise association, G (i n ) + ,G(tn) - And G (t) n ) + ,G(i n ) - For images and text preconditions without pairwise association, max is the maximum value calculation, sim is the distance measurement calculation, and alpha is the boundary parameter.
7. The method for cross-media comprehensive reasoning matched with the image-text sequence as claimed in claim 1, wherein in the training method for cross-task cross-media migration, the cross-media migration occurs in an image precondition branch and a text precondition branch of an implication reasoning network, and the aim is to maintain the cross-media consistency in model training; and the maximum average difference is used as migration loss, the local uniform characterization distribution difference of the image and the text is minimized, and the knowledge migration between the image and the text precondition is realized.
8. The method for cross-media comprehensive reasoning of image-text sequence matching according to claim 1, wherein the reasoning by using the trained image-text sequence matching depth network structure means that after the image-text sequence matching depth network is trained, the network simultaneously receives an image premise, a text premise and a text conclusion as inputs, and each network can respectively obtain a probability value for three conditions of implications, contradictions and irrelevant; and adding probability values of the three paths of networks to obtain an implication relation prediction score, and taking the largest item as an reasoning result.
9. A cross-media comprehensive reasoning system using the method of any one of claims 1-8 for matching a sequence of pictures and texts, comprising:
the data set establishing module is responsible for establishing a cross-media implication reasoning data set, wherein the precursor comprises two media types of images and texts, and the media type of the conclusion is the text;
the network training module is responsible for training the image-text sequence matching deep network by using the cross-media implication reasoning data set, and a cross-task cross-media migration training method is used during network training;
and the reasoning module is used for reasoning by using the trained image-text sequence matching depth network structure and judging implication relation of the given premise and the conclusion.
10. A computer comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911127259.1A CN110889505B (en) | 2019-11-18 | 2019-11-18 | Cross-media comprehensive reasoning method and system for image-text sequence matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911127259.1A CN110889505B (en) | 2019-11-18 | 2019-11-18 | Cross-media comprehensive reasoning method and system for image-text sequence matching |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110889505A CN110889505A (en) | 2020-03-17 |
CN110889505B true CN110889505B (en) | 2023-05-02 |
Family
ID=69747841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911127259.1A Active CN110889505B (en) | 2019-11-18 | 2019-11-18 | Cross-media comprehensive reasoning method and system for image-text sequence matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110889505B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112905820B (en) * | 2021-03-30 | 2022-11-11 | 山西大学 | Multi-graph retrieval method based on logic learning |
CN113111161B (en) * | 2021-04-09 | 2023-09-08 | 北京语言大学 | Cross-media association analysis method |
CN113568735B (en) * | 2021-06-17 | 2024-01-23 | 阿里巴巴新加坡控股有限公司 | Data processing method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220337A (en) * | 2017-05-25 | 2017-09-29 | 北京大学 | A kind of cross-media retrieval method based on mixing migration network |
CN109241912A (en) * | 2018-09-08 | 2019-01-18 | 河南大学 | The target identification method based on class brain across media intelligent towards unmanned autonomous system |
CN110059217A (en) * | 2019-04-29 | 2019-07-26 | 广西师范大学 | A kind of image text cross-media retrieval method of two-level network |
CN110263912A (en) * | 2019-05-14 | 2019-09-20 | 杭州电子科技大学 | A kind of image answering method based on multiple target association depth reasoning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10375200B2 (en) * | 2016-09-26 | 2019-08-06 | Disney Enterprises, Inc. | Recommender engine and user model for transmedia content data |
-
2019
- 2019-11-18 CN CN201911127259.1A patent/CN110889505B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220337A (en) * | 2017-05-25 | 2017-09-29 | 北京大学 | A kind of cross-media retrieval method based on mixing migration network |
CN109241912A (en) * | 2018-09-08 | 2019-01-18 | 河南大学 | The target identification method based on class brain across media intelligent towards unmanned autonomous system |
CN110059217A (en) * | 2019-04-29 | 2019-07-26 | 广西师范大学 | A kind of image text cross-media retrieval method of two-level network |
CN110263912A (en) * | 2019-05-14 | 2019-09-20 | 杭州电子科技大学 | A kind of image answering method based on multiple target association depth reasoning |
Non-Patent Citations (5)
Title |
---|
A Unified Semantic Model for Cross-Media Events Analysis in Online Social Networks;Mingzhe Fang等;《IEEE Access 》;20190217;全文 * |
An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges;Xin Huang等;《IEEE Transactions on Circuits and Systems for Video Technology》;20170517;全文 * |
Deep Cross-Media Knowledge Transfer;Xin Huang,Yuxin Peng;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20180623;全文 * |
TPCKT: Two-Level Progressive Cross-Media Knowledge Transfer;Xin Huang,Yuxin Peng;《IEEE Transactions on Multimedia》;20190414;全文 * |
基于不同模态语义匹配的跨媒体检索;陈祥等;《山东师范大学学报(自然科学版)》;20170915(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110889505A (en) | 2020-03-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021233112A1 (en) | Multimodal machine learning-based translation method, device, equipment, and storage medium | |
CN111291181B (en) | Representation learning for input classification via topic sparse self-encoder and entity embedding | |
CN110121706B (en) | Providing responses in a conversation | |
CN108416065B (en) | Hierarchical neural network-based image-sentence description generation system and method | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
CN112084331A (en) | Text processing method, text processing device, model training method, model training device, computer equipment and storage medium | |
CN111444709A (en) | Text classification method, device, storage medium and equipment | |
CN111104512B (en) | Game comment processing method and related equipment | |
Zhang et al. | Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition | |
CN110889505B (en) | Cross-media comprehensive reasoning method and system for image-text sequence matching | |
Sun et al. | Dynamic emotion modelling and anomaly detection in conversation based on emotional transition tensor | |
CN113987179A (en) | Knowledge enhancement and backtracking loss-based conversational emotion recognition network model, construction method, electronic device and storage medium | |
Qi et al. | MEDT: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis | |
Zhao et al. | Videowhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks | |
Liu et al. | Dual-feature-embeddings-based semi-supervised learning for cognitive engagement classification in online course discussions | |
Bansal et al. | Multilingual personalized hashtag recommendation for low resource Indic languages using graph-based deep neural network | |
Zheng et al. | Multimodal representations learning based on mutual information maximization and minimization and identity embedding for multimodal sentiment analysis | |
CN117173497B (en) | Image generation method and device, electronic equipment and storage medium | |
CN116958997B (en) | Graphic summary method and system based on heterogeneous graphic neural network | |
Ange et al. | Semi-supervised multimodal deep learning model for polarity detection in arguments | |
CN112560440A (en) | Deep learning-based syntax dependence method for aspect-level emotion analysis | |
Bai et al. | Low-rank multimodal fusion algorithm based on context modeling | |
Ke et al. | CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension | |
CN114330297A (en) | Language model pre-training method, language text processing method and device | |
Dharaniya et al. | Automatic scene generation using sentiment analysis and bidirectional recurrent neural network with multi-head attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |