JP7345050B2 - 画像における自然言語句の文脈接地 - Google Patents
画像における自然言語句の文脈接地 Download PDFInfo
- Publication number
- JP7345050B2 JP7345050B2 JP2022506821A JP2022506821A JP7345050B2 JP 7345050 B2 JP7345050 B2 JP 7345050B2 JP 2022506821 A JP2022506821 A JP 2022506821A JP 2022506821 A JP2022506821 A JP 2022506821A JP 7345050 B2 JP7345050 B2 JP 7345050B2
- Authority
- JP
- Japan
- Prior art keywords
- image
- text
- entity
- input
- grounding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000000007 visual effect Effects 0.000 claims description 24
- 238000000034 method Methods 0.000 claims description 21
- 230000003993 interaction Effects 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 2
- 238000000605 extraction Methods 0.000 claims description 2
- 239000013598 vector Substances 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 11
- 238000013459 approach Methods 0.000 description 10
- 230000002860 competitive effect Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005316 response function Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/255—Detecting or recognising potential candidate objects based on visual cues, e.g. shapes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/768—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Description
説明
Claims (7)
- テキスト分岐および画像分岐を含むテキスト画像検索のための方法であって、該方法は、
テキスト質問と画像とを入力として受信し、
前記入力したテキスト質問をトークンに構文解析し、それらをエンティティ埋め込みベクトルに変換し、
前記入力した画像内の視覚オブジェクト候補を特定し、
前記エンティティ埋め込みと視覚オブジェクト候補との間の対応をスコア付けし、
境界ボックスで視覚化された、最も高い確率のスコアを持つ質問テキストエンティティに対応するオブジェクトを、システムのユーザに提供し、
前記画像分岐によって、物体検出器から入力対象として関心領域(RoI)特徴を受信し、
2層の多層パーセプトロン(MLP)を訓練して、画像全体に正規化された前記RoIの位置およびサイズの絶対的空間情報を与えられた空間埋め込みを生成することを含み、
特定の埋め込みまたはオブジェクト特徴抽出が前記方法で使用されない、方法。 - BERT(Bidirectional Encoder Representations from Transformers)ベースのモデルを使用して前記テキスト分岐を事前訓練することをさらに含む、請求項1に記載のシステムの方法。
- 分岐の両方によって、前記MLPの第1の交互作用層への入力として、トークンおよびRoIにそれぞれ位置および空間埋め込みを追加することを、さらに含む、請求項1に記載の方法。
- 前記MLPの各層において、各隠れ表現による自己注意を互いに実行して、層出力として新規の隠れ表現を生成することを、さらに含む、請求項3に記載の方法。
- 各分岐の終わりで、最終的な隠れ状態を接地ヘッドに提供して、質問としてテキストエンティティの隠れ状態を、キーとして画像オブジェクトの隠れ表現を有するクロスモーダルな注意応答を提供することを、さらに含む、請求項4に記載の方法。
- 一致する対応が前記注意応答から決定される、請求項5に記載の方法。
- 前記対応がグラウンドトゥルースと一致しない場合に、エンティティごとに平均二値クロスエントロピー損失を逆伝搬することを、さらに含む、請求項6に記載の方法。
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962899307P | 2019-09-12 | 2019-09-12 | |
US62/899,307 | 2019-09-12 | ||
US17/014,984 | 2020-09-08 | ||
US17/014,984 US11620814B2 (en) | 2019-09-12 | 2020-09-08 | Contextual grounding of natural language phrases in images |
PCT/US2020/050258 WO2021050776A1 (en) | 2019-09-12 | 2020-09-10 | Contextual grounding of natural language phrases in images |
Publications (2)
Publication Number | Publication Date |
---|---|
JP2022543123A JP2022543123A (ja) | 2022-10-07 |
JP7345050B2 true JP7345050B2 (ja) | 2023-09-14 |
Family
ID=74865601
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2022506821A Active JP7345050B2 (ja) | 2019-09-12 | 2020-09-10 | 画像における自然言語句の文脈接地 |
Country Status (4)
Country | Link |
---|---|
US (1) | US11620814B2 (ja) |
JP (1) | JP7345050B2 (ja) |
DE (1) | DE112020004321T5 (ja) |
WO (1) | WO2021050776A1 (ja) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11620814B2 (en) * | 2019-09-12 | 2023-04-04 | Nec Corporation | Contextual grounding of natural language phrases in images |
US11809822B2 (en) * | 2020-02-27 | 2023-11-07 | Adobe Inc. | Joint visual-semantic embedding and grounding via multi-task training for image searching |
US11699275B2 (en) * | 2020-06-17 | 2023-07-11 | Tata Consultancy Services Limited | Method and system for visio-linguistic understanding using contextual language model reasoners |
US11615567B2 (en) * | 2020-11-18 | 2023-03-28 | Adobe Inc. | Image segmentation using text embedding |
EP4248446A1 (en) * | 2020-11-23 | 2023-09-27 | NE47 Bio, Inc. | Protein database search using learned representations |
US11775617B1 (en) * | 2021-03-15 | 2023-10-03 | Amazon Technologies, Inc. | Class-agnostic object detection |
CN113378815B (zh) * | 2021-06-16 | 2023-11-24 | 南京信息工程大学 | 一种场景文本定位识别的系统及其训练和识别的方法 |
WO2022261570A1 (en) * | 2021-08-04 | 2022-12-15 | Innopeak Technology, Inc. | Cross-attention system and method for fast video-text retrieval task with image clip |
CN114691847B (zh) * | 2022-03-10 | 2024-04-26 | 华中科技大学 | 基于深度感知与语义引导的关系注意力网络视觉问答方法 |
CN115098722B (zh) * | 2022-08-25 | 2022-12-27 | 北京达佳互联信息技术有限公司 | 文本和图像的匹配方法、装置、电子设备和存储介质 |
CN116702094B (zh) * | 2023-08-01 | 2023-12-22 | 国家计算机网络与信息安全管理中心 | 一种群体应用偏好特征表示方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170262475A1 (en) | 2014-12-16 | 2017-09-14 | A9.Com, Inc. | Approaches for associating terms with image regions |
US20190130206A1 (en) | 2017-10-27 | 2019-05-02 | Salesforce.Com, Inc. | Interpretable counting in visual question answering |
US20190266236A1 (en) | 2019-05-14 | 2019-08-29 | Intel Corporation | Early exit for natural language processing models |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7695960B2 (en) * | 2003-06-05 | 2010-04-13 | Transgene S.A. | Composition comprising the polyprotein NS3/NS4 and the polypeptide NS5B of HCV, expression vectors including the corresponding nucleic sequences and their therapeutic use |
US8521670B2 (en) * | 2011-05-25 | 2013-08-27 | HGST Netherlands B.V. | Artificial neural network application for magnetic core width prediction and modeling for magnetic disk drive manufacture |
US10831820B2 (en) * | 2013-05-01 | 2020-11-10 | Cloudsight, Inc. | Content based image management and selection |
WO2015189603A1 (en) * | 2014-06-09 | 2015-12-17 | University Of Lincoln | Assembly, apparatus, system and method |
US10146768B2 (en) * | 2017-01-25 | 2018-12-04 | Google Llc | Automatic suggested responses to images received in messages using language model |
US11288508B2 (en) * | 2017-10-02 | 2022-03-29 | Sensen Networks Group Pty Ltd | System and method for machine learning-driven object detection |
US10579897B2 (en) * | 2017-10-02 | 2020-03-03 | Xnor.ai Inc. | Image based object detection |
US11250299B2 (en) * | 2018-11-01 | 2022-02-15 | Nec Corporation | Learning representations of generalized cross-modal entailment tasks |
NL2021956B1 (en) * | 2018-11-08 | 2020-05-15 | Univ Johannesburg | Method and system for high speed detection of diamonds |
US20200250398A1 (en) * | 2019-02-01 | 2020-08-06 | Owkin Inc. | Systems and methods for image classification |
US11620814B2 (en) * | 2019-09-12 | 2023-04-04 | Nec Corporation | Contextual grounding of natural language phrases in images |
-
2020
- 2020-09-08 US US17/014,984 patent/US11620814B2/en active Active
- 2020-09-10 WO PCT/US2020/050258 patent/WO2021050776A1/en active Application Filing
- 2020-09-10 JP JP2022506821A patent/JP7345050B2/ja active Active
- 2020-09-10 DE DE112020004321.5T patent/DE112020004321T5/de active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170262475A1 (en) | 2014-12-16 | 2017-09-14 | A9.Com, Inc. | Approaches for associating terms with image regions |
US20190130206A1 (en) | 2017-10-27 | 2019-05-02 | Salesforce.Com, Inc. | Interpretable counting in visual question answering |
US20190266236A1 (en) | 2019-05-14 | 2019-08-29 | Intel Corporation | Early exit for natural language processing models |
Also Published As
Publication number | Publication date |
---|---|
US20210081728A1 (en) | 2021-03-18 |
WO2021050776A1 (en) | 2021-03-18 |
US11620814B2 (en) | 2023-04-04 |
DE112020004321T5 (de) | 2022-06-09 |
JP2022543123A (ja) | 2022-10-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7345050B2 (ja) | 画像における自然言語句の文脈接地 | |
WO2021233112A1 (zh) | 基于多模态机器学习的翻译方法、装置、设备及存储介质 | |
Lee et al. | Sentiment classification with word localization based on weakly supervised learning with a convolutional neural network | |
Gong et al. | Natural language inference over interaction space | |
JP2021166046A (ja) | 画像条件付きマスク言語モデリングを用いて画像認識のための畳み込みニューラルネットワークを訓練するための方法 | |
CN108960338B (zh) | 基于注意力反馈机制的图像自动语句标注方法 | |
CN115618045B (zh) | 一种视觉问答方法、装置及存储介质 | |
Wang et al. | Stroke constrained attention network for online handwritten mathematical expression recognition | |
Peng et al. | UMass at ImageCLEF Medical Visual Question Answering (Med-VQA) 2018 Task. | |
CN115116066A (zh) | 一种基于字符距离感知的场景文本识别方法 | |
EP4302234A1 (en) | Cross-modal processing for vision and language | |
Wang et al. | Tag: Boosting text-vqa via text-aware visual question-answer generation | |
CN116595195A (zh) | 一种知识图谱构建方法、装置及介质 | |
CN111597816A (zh) | 一种自注意力命名实体识别方法、装置、设备及存储介质 | |
Parvin et al. | Transformer-based local-global guidance for image captioning | |
Hafeth et al. | Semantic representations with attention networks for boosting image captioning | |
Merdivan et al. | Image-based text classification using 2d convolutional neural networks | |
CN115759262A (zh) | 基于知识感知注意力网络的视觉常识推理方法及系统 | |
Tannert et al. | FlowchartQA: the first large-scale benchmark for reasoning over flowcharts | |
Lai et al. | Contextual grounding of natural language entities in images | |
Beltr et al. | Semantic text recognition via visual question answering | |
Wang et al. | TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering | |
El-Gayar | Automatic Generation of Image Caption Based on Semantic Relation using Deep Visual Attention Prediction | |
Chandrasekar et al. | Indic visual question answering | |
Peng et al. | Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20220202 |
|
A131 | Notification of reasons for refusal |
Free format text: JAPANESE INTERMEDIATE CODE: A131 Effective date: 20230404 |
|
A521 | Request for written amendment filed |
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20230512 |
|
TRDD | Decision of grant or rejection written | ||
A01 | Written decision to grant a patent or to grant a registration (utility model) |
Free format text: JAPANESE INTERMEDIATE CODE: A01 Effective date: 20230829 |
|
A61 | First payment of annual fees (during grant procedure) |
Free format text: JAPANESE INTERMEDIATE CODE: A61 Effective date: 20230904 |
|
R150 | Certificate of patent or registration of utility model |
Ref document number: 7345050 Country of ref document: JP Free format text: JAPANESE INTERMEDIATE CODE: R150 |
|
S111 | Request for change of ownership or part of ownership |
Free format text: JAPANESE INTERMEDIATE CODE: R313113 |
|
R350 | Written notification of registration of transfer |
Free format text: JAPANESE INTERMEDIATE CODE: R350 |