CA3156204A1 - Extraction de texte basee sur le domaine - Google Patents
Extraction de texte basee sur le domaine Download PDFInfo
- Publication number
- CA3156204A1 CA3156204A1 CA3156204A CA3156204A CA3156204A1 CA 3156204 A1 CA3156204 A1 CA 3156204A1 CA 3156204 A CA3156204 A CA 3156204A CA 3156204 A CA3156204 A CA 3156204A CA 3156204 A1 CA3156204 A1 CA 3156204A1
- Authority
- CA
- Canada
- Prior art keywords
- text
- text entities
- entities
- entity
- pattern
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims description 51
- 238000000034 method Methods 0.000 claims abstract description 51
- 238000013507 mapping Methods 0.000 claims abstract description 30
- 238000010801 machine learning Methods 0.000 claims description 42
- 230000014509 gene expression Effects 0.000 claims description 24
- 238000013459 approach Methods 0.000 description 44
- 238000003058 natural language processing Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 10
- 239000003129 oil well Substances 0.000 description 7
- 238000012015 optical character recognition Methods 0.000 description 6
- 238000013075 data extraction Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006748 scratching Methods 0.000 description 1
- 230000002393 scratching effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
- G06V30/262—Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/10—Recognition assisted with metadata
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
La présente divulgation concerne un procédé et un système permettant d'extraire des informations à partir du contenu d'un fichier d'entrée. Le procédé peut comprendre l'identification de données textuelles dans le fichier d'entrée, la réception d'une entrée de texte fournie par un utilisateur afin d'identifier des entités de texte pertinentes dans la pluralité d'entités de texte et la génération automatique d'un motif de recherche correspondant à l'entrée de texte. Le procédé peut en outre consister à déterminer un motif associé à chacune de la pluralité d'entités de texte et à mapper le motif de recherche correspondant à l'entrée de texte avec des motifs associés à la pluralité d'entités de texte. Le procédé peut en outre comprendre l'identification, sur la base du mappage, d'un ou plusieurs motifs de correspondance parmi les motifs associés à la pluralité d'entités de texte et l'extraction, à partir de la pluralité d'entités de texte, des entités de texte pertinentes correspondant au ou aux motifs de correspondance.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN201941054421 | 2019-12-30 | ||
IN201941054421 | 2019-12-30 | ||
PCT/IB2020/062535 WO2021137166A1 (fr) | 2019-12-30 | 2020-12-30 | Extraction de texte basée sur le domaine |
Publications (1)
Publication Number | Publication Date |
---|---|
CA3156204A1 true CA3156204A1 (fr) | 2021-07-08 |
Family
ID=76685920
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA3156204A Pending CA3156204A1 (fr) | 2019-12-30 | 2020-12-30 | Extraction de texte basee sur le domaine |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP4085343A4 (fr) |
JP (1) | JP2023507881A (fr) |
AU (1) | AU2020418619A1 (fr) |
CA (1) | CA3156204A1 (fr) |
WO (1) | WO2021137166A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116912845B (zh) * | 2023-06-16 | 2024-03-19 | 广东电网有限责任公司佛山供电局 | 一种基于nlp与ai的智能内容识别与分析方法及装置 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10318804B2 (en) * | 2014-06-30 | 2019-06-11 | First American Financial Corporation | System and method for data extraction and searching |
-
2020
- 2020-12-30 WO PCT/IB2020/062535 patent/WO2021137166A1/fr unknown
- 2020-12-30 EP EP20910797.8A patent/EP4085343A4/fr active Pending
- 2020-12-30 CA CA3156204A patent/CA3156204A1/fr active Pending
- 2020-12-30 AU AU2020418619A patent/AU2020418619A1/en active Pending
- 2020-12-30 JP JP2022525481A patent/JP2023507881A/ja active Pending
Also Published As
Publication number | Publication date |
---|---|
EP4085343A1 (fr) | 2022-11-09 |
WO2021137166A1 (fr) | 2021-07-08 |
EP4085343A4 (fr) | 2024-01-03 |
AU2020418619A1 (en) | 2022-05-26 |
JP2023507881A (ja) | 2023-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11055327B2 (en) | Unstructured data parsing for structured information | |
US8489388B2 (en) | Data detection | |
US11914968B2 (en) | Official document processing method, device, computer equipment and storage medium | |
CN111460827B (zh) | 文本信息处理方法、系统、设备及计算机可读存储介质 | |
US20200159993A1 (en) | Methods, devices and systems for data augmentation to improve fraud detection | |
US9098487B2 (en) | Categorization based on word distance | |
RU2768233C1 (ru) | Нечеткий поиск с использованием форм слов для работы с большими данными | |
CN112258144B (zh) | 基于自动构建目标实体集的政策文件信息匹配和推送方法 | |
CN114298035A (zh) | 一种文本识别脱敏方法及其系统 | |
EP4141818A1 (fr) | Numérisation, transformation et validation de documents | |
CN115223188A (zh) | 票据信息处理方法、装置、电子设备及计算机存储介质 | |
CN112464927B (zh) | 一种信息提取方法、装置及系统 | |
CA3156204A1 (fr) | Extraction de texte basee sur le domaine | |
US20240020473A1 (en) | Domain Based Text Extraction | |
CN112989820B (zh) | 法律文书定位方法、装置、设备及存储介质 | |
CA3170100A1 (fr) | Methode et appareil de traitement de texte et support de donnees lisible par ordinateur | |
CN115294593A (zh) | 一种图像信息抽取方法、装置、计算机设备及存储介质 | |
US20240143632A1 (en) | Extracting information from documents using automatic markup based on historical data | |
US20240020479A1 (en) | Training machine learning models for multi-modal entity matching in electronic records | |
Pandey et al. | A Robust Approach to Plagiarism Detection in Handwritten Documents | |
CN114254627A (zh) | 一种文本纠错的方法、装置、设备和可读存储介质 | |
CN118013970A (zh) | 一种词汇增强方法、装置、设备及存储介质 | |
CN115098642A (zh) | 数据处理方法、装置、计算机设备及存储介质 | |
CN115310462A (zh) | 一种基于nlp技术的元数据识别翻译方法及系统 | |
CN111539605A (zh) | 企业画像的构建方法及装置 |