US11361186B2 - Visual relationship detection method and system based on adaptive clustering learning - Google Patents
Visual relationship detection method and system based on adaptive clustering learning Download PDFInfo
- Publication number
- US11361186B2 US11361186B2 US17/007,213 US202017007213A US11361186B2 US 11361186 B2 US11361186 B2 US 11361186B2 US 202017007213 A US202017007213 A US 202017007213A US 11361186 B2 US11361186 B2 US 11361186B2
- Authority
- US
- United States
- Prior art keywords
- visual
- visual relationship
- relationship
- clustering
- representations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G06K9/6222—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
- G06V10/422—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
- G06V10/426—Graphical representations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5854—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23211—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/251—Fusion techniques of input or preprocessed data
-
- G06K9/6223—
-
- G06K9/6232—
-
- G06K9/6256—
-
- G06K9/6289—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates to the technical field of visual relationship detection, and in particular to a method and a system for visual relationship detection based on adaptive clustering learning.
- visual relationship detection is to detect and localize pair-wise related objects appearing in the image and to infer the visual relationship predicates or interaction modes in-between [ 1 ].
- visual relationships not only capture the spatial and semantic information of “people” and “laptops”, but also need to predict the “look” action in-between.
- Due to structured description and rich semantic space, visual relationship detection can promote the development of high-level visual tasks, such as image retrieval tasks under complex query conditions [ 2 ], image content description tasks [3], vision Inference tasks [4] [5] , image generation tasks [6] , and visual question answering tasks [7] [8] .
- a method for using latent semantic prior knowledge includes: using language knowledge obtained from large-scale visual relationship training annotations and a public text corpora for visual relationship predication [ 10 ].
- a method for utilizing rich contextual visual information includes: establishing visual representation between visual objects and visual relationship predicates, context modeling based on spatial location and statistical dependencies [11] , and proposing contextual message passing mechanisms based on recurrent neural networks to apply to contextual visual features [ 12 ], using long and short-term memory networks to encode global contextual information for visual relational predication [13] .
- the existing visual relationship detection has following deficiencies:
- the present disclosure provides a visual relationship detection method based on adaptive clustering learning, which avoids ignoring latent relatedness information between visual relationships when modeling visual relationships in a unified visual relationship space.
- the present disclosure is capable of fine-grained recognizing visual relationships of different subclasses by mining latent relatedness in-between, which improves the accuracy of visual relationship detection and can be applied to any visual relationship dataset, as described below.
- a visual relationship detection method based on adaptive clustering learning including:
- the method of the present disclosure further includes:
- the method of the present disclosure further includes:
- the step of obtaining the visual relationship sharing representation is specifically:
- obtaining a first product of a joint subject mapping matrix and the context representation of the visual object of the subject obtaining a second product of a joint object mapping matrix and the context representation of the visual object of the object; subtracting the second product from the first product, and dot-multiplying the difference value and convolutional features of a visual relationship candidate region.
- the joint subject mapping matrix and the joint object mapping matrix are mapping matrices that map the visual objects context representation to the joint subspace.
- the visual relationship candidate region is the minimum rectangle box that can fully cover the corresponding visual object candidate regions of the subject and object; the convolutional features are extracted from the visual relationship candidate region by any convolutional neural network.
- the step of obtaining a plurality of preliminary visual relationship enhancing representation is specifically:
- obtaining a third product of a k th clustering subject mapping matrix and the context representation of the visual object of the subject obtaining a fourth product of a k th clustering object mapping matrix and the context representation of the visual object of the object; subtracting the fourth product from the third product, and dot-multiplying the difference value and convolutional features of a visual relationship candidate region to obtain a k th preliminary visual relationship enhancing representation.
- the k th clustering subject mapping matrix and the k th clustering object mapping matrix are mapping matrices that map the visual objects context representation to the k th clustering subspace.
- step of “performing regularization to the preliminary visual relationship enhancing representations of different subspaces by clustering-driven attention mechanisms” is specifically:
- the k th regularized mapping matrix is the k th mapping matrix that transforms the preliminary visual relationship enhancing representation.
- step of “obtaining attentive scores of the clustering subspace” is specifically:
- the k th attention mapping matrix is the mapping matrix that transforms the prior distribution over the category label of visual relationship predicate.
- the step of “fusing the visual relationship sharing representation and the regularized visual relationship enhancing representation with a prior distribution over the category labels of visual relationship predicate, to predict visual relationship predicates by synthetic relational reasoning” is specifically:
- the present disclosure avoids ignoring the latent relatedness information between different visual relationships when modeling visual relationships in a unified visual relationship space, and can perform fine-grained recognition to visual relationships of different subclasses through latent relatedness mining;
- the present disclosure improves the accuracy of visual relationship detection and can be applied to any visual relationship dataset.
- FIG. 1 is a schematic structure diagram of the definition of visual objects and visual relationships in an image
- FIG. 2 is a flowchart of a visual relationship detection method based on adaptive clustering learning
- FIG. 3 is an example diagram showing the visual relationship data of a common visual relationship dataset.
- a visual relationship detection method capable of fully, automatically, and accurately mining latent relatedness information between visual relationships.
- Studies have shown that there exist highly relevant visual relationships in reality.
- the existing visual relationships share a specific visual mode and characteristics, thus we can further complete fine-grained detection of multiple visual relationships based on the recognition of highly relevant visual relationships, and can improve the recall rate of visual relationship detection (hereinafter referred to as VRD).
- the present disclosure proposes a VRD method based on adaptive clustering learning. Referring to FIG. 2 , the method of the present disclosure includes the following steps:
- the visual relationship data set may be any data set containing images and corresponding visual relationship annotations, including but not limited to a VisualGenome data set.
- the training set samples of the visual relationship data set include training images and corresponding visual relationship true label data.
- the visual relationship true label data of each training image includes: a visual object true category label ô i of the subject, a visual object true category label ô j of the object and a corresponding visual relationship predicate true category label r i ⁇ j .
- the visual relationship data set may be any data set containing images and corresponding visual relationship annotations, including but not limited to a VisualGenome data set.
- the training data of the visual relationship data set includes: training images, and corresponding visual relationship true region data and true label data.
- the true region data of each training image include: a visual object true region of the subject, a visual object true region of the object, and a corresponding visual relationship predicate true region.
- the true label data of each training image include: a visual object true category label of the subject, a visual object true category label of the object, and a corresponding visual relationship predicate true category label.
- the embodiment uses the initialized VRD model to predict a subject visual object prediction category label, an object visual object prediction category label and a corresponding visual relationship predicate prediction category label of each training image, and obtain category training errors between the subject visual object prediction category label and the subject visual object true category label, between the object visual object prediction category label and the object visual object true category label, and between visual relationship predicate prediction category label and the visual relationship predicate true category label; and further obtain region training errors between the subject visual object prediction region and the subject visual object true region, between the object visual object prediction region and the object visual object true region, and between visual relationship predicate prediction region and the visual relationship predicate true region.
- the gradient back-propagation operation is performed iteratively to the model according to the category training errors and the region training errors of each training image until the model converges, and the parameters in the trained VRD model are applied to the subsequent steps.
- a candidate region set and a corresponding candidate region feature set are extracted from the input image.
- any object t detector can be used for the extraction operation, including but not limited to the FasterR-CNN object detector used in this embodiment;
- candidate regions include visual object candidate regions and visual relationship candidate regions.
- the visual relationship candidate region is represented by the minimum rectangle box that can fully cover the corresponding visual object candidate regions of the subject and object, and the visual object candidate regions of the subject and object comprise any one of a plurality of the visual object candidate regions.
- the candidate region feature includes: a visual object candidate region convolutional feature f i , a visual object category label probability l i , and a visual object candidate region bounding box coordinate b i ;
- the visual relationship candidate region feature includes a visual relationship candidate region convolutional feature f i,j .
- contextual encoding is performed on the visual object candidate region features to obtain the visual object representations.
- the embodiment adopts a bi-directional long-short-term memory network (biLSTM) to sequentially encode all the visual object candidate region features to obtain the object context representations C:
- W 1 is the learned parameters obtained in the step 102
- [;] denotes the concatenation operation
- N is the number of the input visual object candidate region features.
- h i is the hidden state of the LSTM
- W 2 is the learned parameters obtained in the step 102 .
- the visual object context representations are obtained by visual object representations and visual object label embeddings.
- the detected subject visual object context representation is denoted as d i
- the object visual object context representation is denoted as d j
- the subject and object visual object context representations include any two of a plurality of the visual object context representations
- f i,j is the convolutional features of the visual relationship candidate region corresponding to the subject visual object and the object visual object
- W es and W eo are the joint subject mapping matrix and the joint object mapping matrix that map the visual object context representations to the joint subspace, which are obtained by the step 102 ;
- ⁇ represents element-wise multiplication operation, and E i,j s is a visual relationship sharing representation obtained by calculation.
- the detected subject visual object context representation is denoted as d i
- the object visual object context representation is denoted as d j
- the subject and object visual object context representation include any two of a plurality of the visual object context representations
- f i,j is the convolutional features of the visual relationship candidate region corresponding to the subject visual object and the object visual object
- W es k and W eo k are a clustering subject mapping matrix and a clustering object mapping matrix that map the visual object context representations to the k th clustering subspace, which are obtained by the step 102 ;
- e i,j k represents the obtained k th preliminary visual relationship enhancing representation, and
- K is the number of the clustering subspaces.
- W ⁇ k is the k th attention mapping matrix, which is obtained by the step 102 ;
- w( ⁇ , ⁇ ) is the visual relationship prior function;
- ⁇ i,j k is an attentive score of the k th clustering subspace, and soft max( ⁇ ) represents the following equation:
- i j represents the j th input variable of the soft max function
- n represents the number of input variables of the soft max function
- W b k is the regularized mapping matrix that transforms the kth preliminary visual relationship enhancing representation, which is obtained by the step 102
- E i,j p represents the regularized visual relationship enhancing representation
- E i,j s is the visual relationship sharing representation
- E i,j p is the regularized visual relationship enhancing representation
- w( ⁇ , ⁇ ) is the visual relationship prior function
- B,O) soft max( W r s E i,j s +W r p E i,j p +w ( ô i ,ô j )) (9)
- W r s and W r p are learned visual relationship sharing mapping matrix and visual relationship enhancing mapping matrix, respectively, which are obtained by the step 102 ;
- w (ô i ,ô j ) represents the prior distribution over visual relationship predict category labels when the subject visual object category label is ô i and the object visual object category label is ô j .
- the methods and systems of the present disclosure can be implemented on one or more computers or processors.
- the methods and systems disclosed can utilize one or more computers or processors to perform one or more functions in one or more locations.
- the processing of the disclosed methods and systems can also be performed by software components.
- the disclosed systems and methods can be described in the general context of computer-executable instructions such as program modules, being executed by one or more computers or devices.
- each server or computer processor can include the program modules such as mathematical construction module, simplifying module, and maximum delay calculation module, and other related modules described in the above specification.
- These program modules or module related data can be stored on the mass storage device of the server and one or more client devices.
- Each of the operating modules can comprise elements of the programming and the data management software.
- the components of the server can comprise, but are not limited to, one or more processors or processing units, a system memory, a mass storage device, an operating system, a system memory, an Input/Output Interface, a display device, a display interface, a network adaptor, and a system bus that couples various system components.
- the server and one or more power systems can be implemented over a wired or wireless network connection at physically separate locations, implementing a fully distributed system.
- a server can be a personal computer, portable computer, smartphone, a network computer, a peer device, or other common network node, and so on.
- Logical connections between the server and one or more power systems can be made via a network, such as a local area network (LAN) and/or a general wide area network (WAN).
- LAN local area network
- WAN wide area network
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Image Analysis (AREA)
Abstract
Description
-
- detecting visual objects from an input image and recognizing the visual objects by contextual message passing mechanisms to obtain context representation of the visual objects;
C=biLSTM1([f i W 1 l i]i=1, . . . ,N) (1)
h i=LSTM1([c i ;ô i-1]) (2)
ô i=argmax(W 2 h i) (3)
D=biLSTM2([c i ;W 3 ô i]i=1, . . . ,N) (4)
E i,j s=(W es d i −W eo d j)∘f i,j (5)
e i,j k=(W es k d i −W eo k d j)∘f i,j ,k∈[1,K] (6)
αi,j k=soft max(W α k w(ô i ,ô j)),j∈[1,n],k∈[1,K] (7)
Pr(d i→j |B,O)=soft max(W r s E i,j s +W r p E i,j p +w(ô i ,ô j)) (9)
- [1] Lu C, Krishna R, Bernstein M, et al. Visual relationship detection with language priors[C]//European Conference on Computer Vision. Springer, Cham, 2016: 852-869.
- [2] Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3668-3678.
- [3] Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 684-699.
- [4] Shi J, Zhang H, Li J. Explainable and explicit visual reasoning over scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 8376-8384.
- [5] Yatskar M, Zettlemoyer L, Farhadi A. Situation recognition: Visual semantic role labeling for image understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5534-5542.
- [6] Johnson J, Gupta A, Fei-Fei L. Image generation from scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1219-1228.
- [7] Norcliffe-Brown W, Vafeias S, Parisot S. Learning conditioned graph structures for interpretable visual question answering [C]//Advances in Neural Information Processing Systems. 2018: 8334-8343.
- [8] Teney D, Liu L, van den Hengel A. Graph-structured representation for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 1-9.
- [9] Sadeghi M A, Farhadi A. Recognition using visual phrases [C]//CVPR 2011. IEEE, 2011: 1745-1752.
- [10] Yu R, Li A, Morariu V I, et al. Visual relationship detection with internal and external linguistic knowledge distillation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1974-1982.
- [11] Dai B, Zhang Y, Lin D. Detecting visual relationships with deep relational networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3076-3086.
- [12] Xu D, Zhu Y, Choy C B, et al. Scene graph generation by iterative message passing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5410-5419.
- [13] Zellers R, Yatskar M, Thomson S, et al. Neural motifs: Scene graph parsing with global context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5831-5840.
- [14] Liu A A, Su Y T, Nie W Z, et al. Hierarchical clustering multi-task learning for joint human action grouping and recognition[1]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(1): 102-114.
Claims (16)
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911341230.3A CN111125406B (en) | 2019-12-23 | 2019-12-23 | A Visual Relationship Detection Method Based on Adaptive Clustering Learning |
| CN201911341230.3 | 2019-12-23 | ||
| CN2019113412303 | 2019-12-23 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20210192274A1 US20210192274A1 (en) | 2021-06-24 |
| US11361186B2 true US11361186B2 (en) | 2022-06-14 |
Family
ID=70501453
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/007,213 Active 2041-03-02 US11361186B2 (en) | 2019-12-23 | 2020-08-31 | Visual relationship detection method and system based on adaptive clustering learning |
Country Status (2)
| Country | Link |
|---|---|
| US (1) | US11361186B2 (en) |
| CN (1) | CN111125406B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11599749B1 (en) * | 2019-12-23 | 2023-03-07 | Thales Sa | Method of and system for explainable knowledge-based visual question answering |
Families Citing this family (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111325243B (en) * | 2020-02-03 | 2023-06-16 | 天津大学 | Visual relationship detection method based on regional attention learning mechanism |
| CN111985505B (en) * | 2020-08-21 | 2024-02-13 | 南京大学 | An interest visual relationship detection method and device based on interest propagation network |
| CN112163608B (en) * | 2020-09-21 | 2023-02-03 | 天津大学 | Visual relation detection method based on multi-granularity semantic fusion |
| CN112347965A (en) * | 2020-11-16 | 2021-02-09 | 浙江大学 | A method and system for video relationship detection based on spatiotemporal graph |
| CN113643241B (en) * | 2021-07-15 | 2024-10-29 | 北京迈格威科技有限公司 | Interactive relation detection method, interactive relation detection model training method and device |
| CN113688729B (en) * | 2021-08-24 | 2023-04-07 | 上海商汤科技开发有限公司 | Behavior recognition method and device, electronic equipment and storage medium |
| CN113836339B (en) * | 2021-09-01 | 2023-09-26 | 淮阴工学院 | Scene graph generation method based on global information and position embedding |
| CN114155422B (en) * | 2021-11-24 | 2025-05-27 | 北京百度网讯科技有限公司 | A method, device, equipment and storage medium for answering visual questions |
| CN114239594B (en) * | 2021-12-06 | 2024-03-08 | 西北工业大学 | Natural language visual reasoning method based on attention mechanism |
| CN115565052B (en) * | 2022-08-30 | 2026-02-10 | 电子科技大学 | A method for generating unbiased scene graphs based on a dual-branch hybrid learning network |
| CN115861697B (en) * | 2022-12-07 | 2025-08-22 | 山西大学 | A visual relationship detection method based on relationship label hierarchy |
| CN116740021A (en) * | 2023-06-14 | 2023-09-12 | 北京理工大学 | A graph convolution visual relationship detection method under industrial scene data sets |
| CN119229204B (en) * | 2024-09-30 | 2026-01-13 | 天津大学 | Fine-granularity multi-mode prompt guided visual relationship recognition method and device |
| CN119417947B (en) * | 2024-10-18 | 2025-10-14 | 重庆邮电大学 | A text-knowledge-enhanced scene graph generation method |
| CN119068272B (en) * | 2024-11-07 | 2025-02-25 | 山东省工业技术研究院 | Class-level relationship constraint and structured graph enhancement method, system, medium and terminal |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101894170B (en) * | 2010-08-13 | 2011-12-28 | 武汉大学 | Semantic relationship network-based cross-mode information retrieval method |
| CN109564706B (en) * | 2016-12-01 | 2023-03-10 | 英特吉姆股份有限公司 | User interaction platform based on intelligent interactive augmented reality |
| CN108229272B (en) * | 2017-02-23 | 2020-11-27 | 北京市商汤科技开发有限公司 | Visual relation detection method and device and visual relation detection training method and device |
-
2019
- 2019-12-23 CN CN201911341230.3A patent/CN111125406B/en active Active
-
2020
- 2020-08-31 US US17/007,213 patent/US11361186B2/en active Active
Non-Patent Citations (16)
| Title |
|---|
| An-An Liu, Member, IEEE, Yu-Ting Su, Wei-Zhi Nie, and Mohan Kankanhalli, Fellow, IEEE,Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, No. 1, Jan. 2017. |
| Bo Dai, Yuqi Zhang, Dahua Lin, Detecting Visual Relationships with Deep Relational Networks. |
| Cewu Lu(B), Ranjay Krishna(B), Michael Bernstein, and Li Fei-Fei, Visual Relationship Detection with Language Priors, Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 852-869, 2016. |
| Damien Teney, Lingqiao Liu, Anton Van Den Hengel, Graph-Structured Representations for Visual Question Answering. |
| Danhei Xu1, Y Zhu1 Christopher B. Choy2 Li Fei-Fei1, Scene Graph Generation by Iterative Message Passing. |
| Han et al, "Visual Relationship Detection Based on Local Feature and Context Feature" (published in 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC), pp. 420-424, Aug. 2018) (Year: 2018). * |
| Jiaxin Shi1, Hanwang Zhang2 Juanzi Li1, Explainable and Explicit Visual Reasoning over Scene Graphs, the final published version of the proceedings is available on IEEE Xplore. |
| Jung et al, "Visual Relationship Detection with Language prior and Softmax" (published in 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), pp. 143-148, Dec. 2018) (Year: 2018). * |
| Justin Johnson1, Ranjay Krishna1, Michael Stark2, Li-Jia Li3,4, David A. Shamma3, Michael S. Bernstein1, Li Fei-Fei1, Image Retrieval using Scene Graphs, IStanford University, 2Max Planck Institute for Informatics, 3Yahoo Labs, 4Snapchat. |
| Justin Johnson1,2* Agrim Gupta1 Li Fei-Fei1,2, Image Generation from Scene Graphs. |
| Mark Yatskar1, Luke Zettlemoyer1, Ali Farhadi1,2, Situation Recognition: Visual Semantic Role Labeling for Image Understanding. |
| Mohammad Amin Sadeghi1;2, Ali Farhadi1, Recognition Using Visual Phrases. |
| Rowan Zellers1, Mark Yatskar1,2, Sam Thomson3, Yejin Choi1,2, Neural Motifs: Scene Graph Parsing with Global Context. |
| Ruichi Yu, Ang Li, Vlad I. Morariu, Larry S. Davis, Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. |
| Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei, Exploring Visual Relationship for Image Captioning,The content of this paper is identical to the content of the officially published ECCV 2018, LNCS version of the paper as available on SpringerLink: https://link.springer.com/conference/eccv. |
| Will Norcliffe-Brown, Efstathios Vafeias, Sarah Parisot, Learning Conditioned Graph Structures for Interpretable Visual Question Answering, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11599749B1 (en) * | 2019-12-23 | 2023-03-07 | Thales Sa | Method of and system for explainable knowledge-based visual question answering |
Also Published As
| Publication number | Publication date |
|---|---|
| US20210192274A1 (en) | 2021-06-24 |
| CN111125406A (en) | 2020-05-08 |
| CN111125406B (en) | 2023-08-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11361186B2 (en) | Visual relationship detection method and system based on adaptive clustering learning | |
| Niu et al. | Recurrent attention unit: A new gated recurrent unit for long-term memory of important parts in sequential data | |
| Torfi et al. | Natural language processing advancements by deep learning: A survey | |
| US11768876B2 (en) | Method and device for visual question answering, computer apparatus and medium | |
| US11775574B2 (en) | Method and apparatus for visual question answering, computer device and medium | |
| Liu et al. | Image Captioning in news report scenario | |
| US11301725B2 (en) | Visual relationship detection method and system based on region-aware learning mechanisms | |
| US11599749B1 (en) | Method of and system for explainable knowledge-based visual question answering | |
| US20240152770A1 (en) | Neural network search method and related device | |
| WO2021223323A1 (en) | Image content automatic description method based on construction of chinese visual vocabulary list | |
| US20190095788A1 (en) | Supervised explicit semantic analysis | |
| CN111797241A (en) | Event argument extraction method and device based on reinforcement learning | |
| Yan et al. | Multimodal feature fusion based on object relation for video captioning | |
| Agarwal et al. | From methods to datasets: A survey on Image-Caption Generators | |
| Rasool et al. | WRS: a novel word-embedding method for real-time sentiment with integrated LSTM-CNN model | |
| US20240311267A1 (en) | Efficient hardware accelerator configuration exploration | |
| Tian et al. | Attention aware bidirectional gated recurrent unit based framework for sentiment analysis | |
| Elsayed et al. | LiteLSTM architecture based on weights sharing for recurrent neural networks | |
| Verma et al. | Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos | |
| Ma et al. | Aspect-based attention LSTM for aspect-level sentiment analysis | |
| US20250371004A1 (en) | Techniques for joint context query rewrite and intent detection | |
| Liu et al. | GCN-LSTM: multi-label educational emotion prediction based on graph Convolutional network and long and short term memory network fusion label correlation in online social networks | |
| Zhu et al. | Enhance sketch recognition’s explainability via semantic component-level parsing | |
| Jaiswal et al. | An Efficient Image Captioning Method Based on Beam Search | |
| Njikam et al. | An evaluation of machine learning and deep learning approach on Ekman sentiment classification |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: TIANJIN UNIVERSITY, CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIU, ANAN;WANG, YANHUI;XU, NING;AND OTHERS;REEL/FRAME:053642/0230 Effective date: 20200611 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 4 |