CA3156204A1 - Extraction de texte basee sur le domaine - Google Patents

Extraction de texte basee sur le domaine Download PDF

Info

Publication number
CA3156204A1
CA3156204A1 CA3156204A CA3156204A CA3156204A1 CA 3156204 A1 CA3156204 A1 CA 3156204A1 CA 3156204 A CA3156204 A CA 3156204A CA 3156204 A CA3156204 A CA 3156204A CA 3156204 A1 CA3156204 A1 CA 3156204A1
Authority
CA
Canada
Prior art keywords
text
text entities
entities
entity
pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3156204A
Other languages
English (en)
Inventor
Madhusudan Singh
Kaushik Halder
Nirmal VANAPALLI VENKATA RAMESH RAYULU
Aritra Ghosh Dastidar
Ajay SHA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
L&T Technology Services Ltd
Original Assignee
L&T Technology Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by L&T Technology Services Ltd filed Critical L&T Technology Services Ltd
Publication of CA3156204A1 publication Critical patent/CA3156204A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/10Recognition assisted with metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

La présente divulgation concerne un procédé et un système permettant d'extraire des informations à partir du contenu d'un fichier d'entrée. Le procédé peut comprendre l'identification de données textuelles dans le fichier d'entrée, la réception d'une entrée de texte fournie par un utilisateur afin d'identifier des entités de texte pertinentes dans la pluralité d'entités de texte et la génération automatique d'un motif de recherche correspondant à l'entrée de texte. Le procédé peut en outre consister à déterminer un motif associé à chacune de la pluralité d'entités de texte et à mapper le motif de recherche correspondant à l'entrée de texte avec des motifs associés à la pluralité d'entités de texte. Le procédé peut en outre comprendre l'identification, sur la base du mappage, d'un ou plusieurs motifs de correspondance parmi les motifs associés à la pluralité d'entités de texte et l'extraction, à partir de la pluralité d'entités de texte, des entités de texte pertinentes correspondant au ou aux motifs de correspondance.
CA3156204A 2019-12-30 2020-12-30 Extraction de texte basee sur le domaine Pending CA3156204A1 (fr)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
IN201941054421 2019-12-30
IN201941054421 2019-12-30
PCT/IB2020/062535 WO2021137166A1 (fr) 2019-12-30 2020-12-30 Extraction de texte basée sur le domaine

Publications (1)

Publication Number Publication Date
CA3156204A1 true CA3156204A1 (fr) 2021-07-08

Family

ID=76685920

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3156204A Pending CA3156204A1 (fr) 2019-12-30 2020-12-30 Extraction de texte basee sur le domaine

Country Status (5)

Country Link
EP (1) EP4085343A4 (fr)
JP (1) JP2023507881A (fr)
AU (1) AU2020418619A1 (fr)
CA (1) CA3156204A1 (fr)
WO (1) WO2021137166A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912845B (zh) * 2023-06-16 2024-03-19 广东电网有限责任公司佛山供电局 一种基于nlp与ai的智能内容识别与分析方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318804B2 (en) * 2014-06-30 2019-06-11 First American Financial Corporation System and method for data extraction and searching

Also Published As

Publication number Publication date
EP4085343A1 (fr) 2022-11-09
WO2021137166A1 (fr) 2021-07-08
EP4085343A4 (fr) 2024-01-03
AU2020418619A1 (en) 2022-05-26
JP2023507881A (ja) 2023-02-28

Similar Documents

Publication Publication Date Title
US11055327B2 (en) Unstructured data parsing for structured information
US8489388B2 (en) Data detection
US11914968B2 (en) Official document processing method, device, computer equipment and storage medium
CN111460827B (zh) 文本信息处理方法、系统、设备及计算机可读存储介质
US20200159993A1 (en) Methods, devices and systems for data augmentation to improve fraud detection
US9098487B2 (en) Categorization based on word distance
RU2768233C1 (ru) Нечеткий поиск с использованием форм слов для работы с большими данными
CN112258144B (zh) 基于自动构建目标实体集的政策文件信息匹配和推送方法
CN114298035A (zh) 一种文本识别脱敏方法及其系统
EP4141818A1 (fr) Numérisation, transformation et validation de documents
CN115223188A (zh) 票据信息处理方法、装置、电子设备及计算机存储介质
CN112464927B (zh) 一种信息提取方法、装置及系统
CA3156204A1 (fr) Extraction de texte basee sur le domaine
US20240020473A1 (en) Domain Based Text Extraction
CN112989820B (zh) 法律文书定位方法、装置、设备及存储介质
CA3170100A1 (fr) Methode et appareil de traitement de texte et support de donnees lisible par ordinateur
CN115294593A (zh) 一种图像信息抽取方法、装置、计算机设备及存储介质
US20240143632A1 (en) Extracting information from documents using automatic markup based on historical data
US20240020479A1 (en) Training machine learning models for multi-modal entity matching in electronic records
Pandey et al. A Robust Approach to Plagiarism Detection in Handwritten Documents
CN114254627A (zh) 一种文本纠错的方法、装置、设备和可读存储介质
CN118013970A (zh) 一种词汇增强方法、装置、设备及存储介质
CN115098642A (zh) 数据处理方法、装置、计算机设备及存储介质
CN115310462A (zh) 一种基于nlp技术的元数据识别翻译方法及系统
CN111539605A (zh) 企业画像的构建方法及装置