GB2627092A - Deep learning techniques for extraction of embedded data from documents - Google Patents

Deep learning techniques for extraction of embedded data from documents Download PDF

Info

Publication number
GB2627092A
GB2627092A GB2405984.2A GB202405984A GB2627092A GB 2627092 A GB2627092 A GB 2627092A GB 202405984 A GB202405984 A GB 202405984A GB 2627092 A GB2627092 A GB 2627092A
Authority
GB
United Kingdom
Prior art keywords
text
sub
embeddings
groupings
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB2405984.2A
Other languages
English (en)
Other versions
GB202405984D0 (en
Inventor
Zhong Xu
Don Thanuja Samodhye Dharmasiri Yakupitiyage
Long Duong Thanh
Edward Johnson Mark
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Publication of GB202405984D0 publication Critical patent/GB202405984D0/en
Publication of GB2627092A publication Critical patent/GB2627092A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
GB2405984.2A 2021-10-29 2022-08-15 Deep learning techniques for extraction of embedded data from documents Pending GB2627092A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202163273761P 2021-10-29 2021-10-29
US17/819,445 US12367352B2 (en) 2021-10-29 2022-08-12 Deep learning techniques for extraction of embedded data from documents
PCT/US2022/074974 WO2023076754A1 (en) 2021-10-29 2022-08-15 Deep learning techniques for extraction of embedded data from documents

Publications (2)

Publication Number Publication Date
GB202405984D0 GB202405984D0 (en) 2024-06-12
GB2627092A true GB2627092A (en) 2024-08-14

Family

ID=86147364

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2405984.2A Pending GB2627092A (en) 2021-10-29 2022-08-15 Deep learning techniques for extraction of embedded data from documents

Country Status (6)

Country Link
US (2) US12367352B2 (https=)
JP (1) JP2024540111A (https=)
KR (1) KR20240091051A (https=)
CN (1) CN118202344A (https=)
GB (1) GB2627092A (https=)
WO (1) WO2023076754A1 (https=)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12158900B2 (en) * 2022-10-28 2024-12-03 Abbyy Development Inc. Extracting information from documents using automatic markup based on historical data
US12315052B2 (en) * 2022-12-15 2025-05-27 Accenture Global Solutions Limited Generation of context-aware word embedding vectors for given semantic properties of a word using few texts
US12314318B2 (en) * 2023-02-17 2025-05-27 Snowflake Inc. Enhanced searching using fine-tuned machine learning models
US12562163B2 (en) * 2023-05-12 2026-02-24 Servicenow, Inc. Bidirectional assistant for development platforms
US11928569B1 (en) * 2023-06-30 2024-03-12 Intuit, Inc. Automated user experience orchestration using natural language based machine learning techniques
CN116561602B (zh) * 2023-07-10 2023-09-19 三峡高科信息技术有限责任公司 一种用于销售成本结转的销采物资自动匹配的方法
US12277150B2 (en) * 2023-07-20 2025-04-15 Quantem Healthcare, Inc. Computing technologies for hierarchies of chatbot application programs operative based on data structures containing unstructured texts
CN117097790A (zh) * 2023-08-08 2023-11-21 北京字跳网络技术有限公司 一种信息推送方法、装置、计算机设备及存储介质
US20250371272A1 (en) * 2024-06-04 2025-12-04 Optum, Inc. Modified large language model architecture with span-level attention mechanism for conversion of natural language text to structured knowledge graph

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004326600A (ja) * 2003-04-25 2004-11-18 Fujitsu Ltd 構造化文書のクラスタリング装置
US20190073420A1 (en) * 2017-09-04 2019-03-07 Borislav Agapiev System for creating a reasoning graph and for ranking of its nodes
KR20190058935A (ko) * 2017-11-22 2019-05-30 주식회사 와이즈넛 문서 내 핵심 키워드 추출 시스템 및 방법
US20200073882A1 (en) * 2018-08-31 2020-03-05 Accenture Global Solutions Limited Artificial intelligence based corpus enrichment for knowledge population and query response
US10607042B1 (en) * 2019-02-12 2020-03-31 Live Objects, Inc. Dynamically trained models of named entity recognition over unstructured data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10380259B2 (en) * 2017-05-22 2019-08-13 International Business Machines Corporation Deep embedding for natural language content based on semantic dependencies
US11914954B2 (en) * 2019-12-08 2024-02-27 Virginia Tech Intellectual Properties, Inc. Methods and systems for generating declarative statements given documents with questions and answers
US11861314B2 (en) * 2020-04-03 2024-01-02 Asapp, Inc. Extracting clinical follow-ups from discharge summaries
US11741146B2 (en) * 2020-07-13 2023-08-29 Nec Corporation Embedding multi-modal time series and text data
US20220093088A1 (en) * 2020-09-24 2022-03-24 Apple Inc. Contextual sentence embeddings for natural language processing applications
CN113011169B (zh) * 2021-01-27 2022-11-11 北京字跳网络技术有限公司 一种会议纪要的处理方法、装置、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004326600A (ja) * 2003-04-25 2004-11-18 Fujitsu Ltd 構造化文書のクラスタリング装置
US20190073420A1 (en) * 2017-09-04 2019-03-07 Borislav Agapiev System for creating a reasoning graph and for ranking of its nodes
KR20190058935A (ko) * 2017-11-22 2019-05-30 주식회사 와이즈넛 문서 내 핵심 키워드 추출 시스템 및 방법
US20200073882A1 (en) * 2018-08-31 2020-03-05 Accenture Global Solutions Limited Artificial intelligence based corpus enrichment for knowledge population and query response
US10607042B1 (en) * 2019-02-12 2020-03-31 Live Objects, Inc. Dynamically trained models of named entity recognition over unstructured data

Also Published As

Publication number Publication date
JP2024540111A (ja) 2024-10-31
US20250307566A1 (en) 2025-10-02
GB202405984D0 (en) 2024-06-12
KR20240091051A (ko) 2024-06-21
US20230139397A1 (en) 2023-05-04
US12367352B2 (en) 2025-07-22
WO2023076754A1 (en) 2023-05-04
CN118202344A (zh) 2024-06-14

Similar Documents

Publication Publication Date Title
GB2627092A (en) Deep learning techniques for extraction of embedded data from documents
JP2024540111A5 (https=)
CN110770735B (zh) 具有嵌入式数学表达式的文档的编码转换
JP2618832B2 (ja) 文書の論理構造の解析方法及びシステム
CN111914825B (zh) 文字识别方法、装置及电子设备
KR101851785B1 (ko) 챗봇의 트레이닝 세트 생성 장치 및 방법
WO2007005937A2 (en) Grammatical parsing of document visual structures
Layton et al. Recentred local profiles for authorship attribution
CN114417871A (zh) 模型训练及命名实体识别方法、装置、电子设备及介质
US20220414328A1 (en) Method and system for predicting field value using information extracted from a document
CN111737949B (zh) 题目内容提取方法、装置、可读存储介质及计算机设备
CN110413972B (zh) 一种基于nlp技术的表名字段名智能补全方法
KR101851789B1 (ko) 도메인 유사어구 생성 장치 및 방법
Fréry et al. Ujm at clef in author verification based on optimized classification trees
Clausner et al. Efficient ocr training data generation with aletheia
CN106446147A (zh) 一种基于结构化特征的情感分析方法
CN116822634A (zh) 一种基于布局感知提示的文档视觉语言推理方法
CN114492437B (zh) 关键词识别方法、装置、电子设备及存储介质
Sun et al. Squared english word: A method of generating glyph to use super characters for sentiment analysis
CN112528682A (zh) 语种检测方法、装置、电子设备和存储介质
CN111984845B (zh) 网站错别字识别方法和系统
CN107797986A (zh) 一种基于lstm‑cnn的混合语料分词方法
CN109325237B (zh) 用于机器翻译的完整句识别方法与系统
Se et al. AMRITA_CEN@ FIRE 2015: Extracting entities for social media texts in Indian languages
Dinu et al. Romanian syllabication using machine learning