WO2023092211A1 - Método para extração e estruturação de informações - Google Patents
Método para extração e estruturação de informações Download PDFInfo
- Publication number
- WO2023092211A1 WO2023092211A1 PCT/BR2022/050465 BR2022050465W WO2023092211A1 WO 2023092211 A1 WO2023092211 A1 WO 2023092211A1 BR 2022050465 W BR2022050465 W BR 2022050465W WO 2023092211 A1 WO2023092211 A1 WO 2023092211A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- information
- synthetic
- text
- documents
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000012015 optical character recognition Methods 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 12
- 238000013145 classification model Methods 0.000 claims abstract description 12
- 238000001514 detection method Methods 0.000 claims abstract description 9
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 11
- 238000012937 correction Methods 0.000 claims description 8
- 238000003908 quality control method Methods 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 9
- 238000013527 convolutional neural network Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000005094 computer simulation Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Definitions
- the present invention is related to the area of information retrieval in documents of interest to the oil and gas (O&G) industry. Information from technical documents is extracted with the invention, subsequently this information can be enriched with metadata of interest in the domain, indexed and searched by search engines.
- O&G oil and gas
- Extraction and structuring of information is an automatic task, performed by a computer, and composed of several subprocesses.
- different challenges arise for this type of task. For example, it may be necessary to extract information from a page correctly without cluttering up text, images and tables; or even, structure images or tables and relate them to their descriptive captions.
- O&G oil and gas
- Document US20200167558A1 discloses a system and method for using one or more computing devices to categorize text regions of an electronic document into types of document objects based on a combination of semantic information and appearance information of the electronic document .
- Document US20210158093A1 discloses a system that creates computer-generated synthetic documents with precisely labeled page elements.
- the synthetic document generation system determines layout parameters for a plurality of image layouts.
- Document CN110334346B discloses a method and device to extract information from a PDF file, based on marking positions of images and texts.
- the aim of the process is to structure textual information into key and value collections, organized hierarchically based on document layouts.
- the method for extracting text regions uses an abstraction of line segments, based on the extraction of character coordinates, data that is immediately available from the internal structure of PDF files, and therefore cannot be applied to documents that require OCR. Therefore, it distances itself from the more general method based on computer vision by neural networks that is used in this invention.
- Document CN111259830A reveals a method for obtaining training data from PDF documents after manual labeling, using these data to train a convolutional neural network, and using the corresponding model of this trained network to extract information from PDF documents in the field of international agricultural trade.
- it includes a method for obtaining training data from real PDF documents and the subsequent training of the convolutional neural network for classifying PDF file content fragments.
- it differs fundamentally from this invention in the way of obtaining data for training, which in the case of this invention are synthetic documents, which means a much greater potential for using training examples for the neural network, and therefore greater accuracy. provided for the object detection model.
- Document CN113343658A reveals a method, device and calculation method for extracting information from tables in PDF files.
- the information in a PDF file is mainly divided into paragraphs of text, tables and images. Extracting images is relatively simple, while extracting paragraphs and text tables is more complicated, especially extracting complex nested tables.
- the method works by extracting the simplest form possible from a table, and proceeds recursively through the table finding the nested tables, until extracting the complete table.
- the document alleges that the method has "advantages of being simple to implement, having high extraction efficiency, high speed and the ability to retain the internal logical relationships of complex tables". It is only specialized in extracting information from tables in PDF files, therefore it is not applicable to extracting images and subtitles.
- the invention aims to automatically extract textual data, images and tables from scanned documents in different formats.
- the method uses computational models of artificial intelligence developed specifically to meet the particularities of the specialized domain of the oil and gas (O&G) industry.
- the invention was designed to support execution in a supercomputing environment, offering support for high processing parallelism, in order to allow the efficient extraction of a large amount of unstructured documents.
- the invention proposes a method that receives a set of unstructured documents at the input, extracts and structures their information, reorganizes and makes this information available in files so that they can be consumed by other systems.
- the method for extracting and structuring information comprises: (1) PDF page separator, (2) block detection and segmentation model, (3) table extractor, ( 4) image extractor, (5) image classification model, (6) text extractor, (7) computer vision model to improve text image quality, (8) optical character recognition model, (9 ) model for spelling correction, (10) models for semantic text enrichment, (11 ) output file organizer, (12) metadata aggregator for information enrichment.
- the invention proposes a complementary process for generating synthetic documents that emulate real documents, used to train and update the artificial intelligence models used in the main process of information extraction.
- the method for generating synthetic documents and training the artificial intelligence models comprises: (1) Generating synthetic documents, (2) Training/Tuning of computer vision and classification models, (3) Quality control of models under synthetic and real sets, (4) Evaluation of extraction result in the oil and gas (O&G) domain, (5) Identification of new formats or changes in existing formats, (6) Adjustment of parameters / Configuration of new synthetic formats.
- FIG. 1 illustrates a flowchart of the method for extracting and structuring information.
- FIG. 2 represents a diagram that describes the iterative process that comprises the generation of synthetic documents, the training of models based on these generated documents and the quality control of the model, until the point where the model is able to be used in the extraction of documents in the field of oil and gas (O&G), with acceptable performance.
- O&G oil and gas
- FIG. 3 presents an example of segmentation into blocks from a document, the classification of blocks according to the type of content, and the processing for extracting information according to the respective classification of each block (text, image or table).
- the method for extracting and structuring information is a process that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems.
- the method proposed here as illustrated in the diagram in Figure 1, comprises: (1) document page separator, (2) block detection and segmentation model, (3) table extractor, (4) image extractor, ( 5) image classification model, (6) text extractor, (7) computer vision model to improve text image quality, (8) optical character recognition model, (09) model for spelling correction, ( 10) models for semantic text enrichment, (11) output file organizer and (12) metadata aggregator for information enrichment.
- the first step of the method consists of (1 ) transforming the document pages into images and using (2) artificial intelligence models based on convolutional neural networks to identify the main blocks that make up these pages, segmenting them into text blocks, images and tables.
- the detection, delimitation and classification of these blocks can be done using typical deep neural networks for this type of application, such as Mask R-CNN, but not limited to these.
- each block receives the respectively most appropriate treatment.
- the blocks identified as tables are processed by one (3) table extractor, so that the information contained in the tables is structured in a file in CSV format.
- the images with their respective captions are submitted to one (4) image extractor, recorded in individual files and processed by one (5) image classification model.
- Blocks identified as text, list or equation are submitted to a (6) text extractor and, if it is not possible to recover the information directly from the main file, they are pre-processed by (7) computer vision models to improve the quality of the image, to reduce noise, geometric deformations, or irregularities in the background of the text image.
- Such models can be, for example, but without loss of generality, based on convolutional neural networks coupled to conditional generative adversarial networks (CNN+GAN), which learn to map an input image with low quality to a corresponding image with more readable text. .
- Figure 5 shows the text treatment flow from left to right, it can be seen that the system is divided into four processes: the text alignment corrector; then a neural network that improves image quality, named as TextCleaner-Net; then the optical character recognition (OCR) model, effectively; and finally, the classifier to determine the source type of each word based on the MobileNet neural network.
- the text alignment corrector a neural network that improves image quality, named as TextCleaner-Net
- OCR optical character recognition
- the alignment corrector is composed of a convolutional neural network (CNN) that estimates the angle of inclination of the text in the image, followed by a geometric transformation matrix that rotates the image in the opposite direction to the angle estimated by the network.
- the TextCleanerNet network is a Generative Adversarial Network (GANs) that takes an image as input and produces a clean version of it.
- GANs Generative Adversarial Network
- the OCR algorithm selected was Tesseract 5, which represents the state of the art in the area, and which, in addition, has multi-language support through low computational cost.
- the font detector is a classifier based on a MobileNet network, which is used to determine the font type of each word preceded by OCR. To do this, the classifier takes advantage of the boxes detected by OCR to extract the clippings of images used as input for the classifier.
- B) Use the (2) block detection model to identify the main elements of each page, segmenting them into blocks of texts, images and tables;
- textual content is also subjected to steps of (9) spelling correction considering the oil and gas (O&G) domain vocabulary and (10) enrichment with semantic metadata (including processes for recognizing named entities, relationship identification and Part of Speech Tagging), being stored in XML files;
- O&G oil and gas
- semantic metadata including processes for recognizing named entities, relationship identification and Part of Speech Tagging
- the method for generating synthetic documents and training the artificial intelligence models comprises: (1) Generating synthetic documents, (2) Training/Tuning of computer vision and classification models, (3) Quality control of models under synthetic and real sets, (4) Evaluation of extraction result in the oil and gas (O&G) domain, (5) Identification of new formats or changes in existing formats, (6) Adjustment of parameters / Configuration of new synthetic formats.
- Some of the parameters to be adjusted, and which are associated with synthetic document formats, are: coordinates and dimensions of objects on the page; synthetic annotation label to identify the object type (text, equation, image, table, line); object grouping - enabling sorting of figure captions, table captions and equation captions; font (typography), style and font size of the text.
- values for these parameters are randomly chosen according to ranges with predefined probabilities for the formats, and fragments of synthesized objects are positioned on the page obeying the chosen values for these parameters.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Geometry (AREA)
- Medical Informatics (AREA)
- Computer Graphics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280067231.7A CN118076982A (zh) | 2021-11-26 | 2022-11-28 | 信息提取和结构化方法 |
EP22896898.8A EP4439494A1 (en) | 2021-11-26 | 2022-11-28 | Method for extracting and structuring information |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
BRBR102021023977-8 | 2021-11-26 | ||
BR102021023977-8A BR102021023977A2 (pt) | 2021-11-26 | Método para extração e estruturação de informações |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023092211A1 true WO2023092211A1 (pt) | 2023-06-01 |
Family
ID=86538468
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/BR2022/050465 WO2023092211A1 (pt) | 2021-11-26 | 2022-11-28 | Método para extração e estruturação de informações |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4439494A1 (pt) |
CN (1) | CN118076982A (pt) |
WO (1) | WO2023092211A1 (pt) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190080164A1 (en) | 2017-09-14 | 2019-03-14 | Chevron U.S.A. Inc. | Classification of character strings using machine-learning |
CN110334346A (zh) | 2019-06-26 | 2019-10-15 | 京东数字科技控股有限公司 | 一种pdf文件的信息抽取方法和装置 |
US20200167558A1 (en) | 2017-07-21 | 2020-05-28 | Adobe Inc. | Semantic page segmentation of vector graphics documents |
CN111259830A (zh) | 2020-01-19 | 2020-06-09 | 中国农业科学院农业信息研究所 | 一种海外农业pdf文档内容碎片化方法及系统 |
CN111291619A (zh) * | 2020-01-14 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | 一种在线识别理赔单据中文字的方法、装置及客户端 |
US20210117667A1 (en) * | 2019-10-17 | 2021-04-22 | Adobe Inc. | Document structure identification using post-processing error correction |
US20210158093A1 (en) | 2019-11-21 | 2021-05-27 | Adobe Inc. | Automatically generating labeled synthetic documents |
CN113343658A (zh) | 2021-07-01 | 2021-09-03 | 湖南四方天箭信息科技有限公司 | 一种pdf文件信息抽取方法、装置以及计算机设备 |
-
2022
- 2022-11-28 CN CN202280067231.7A patent/CN118076982A/zh active Pending
- 2022-11-28 WO PCT/BR2022/050465 patent/WO2023092211A1/pt active Application Filing
- 2022-11-28 EP EP22896898.8A patent/EP4439494A1/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200167558A1 (en) | 2017-07-21 | 2020-05-28 | Adobe Inc. | Semantic page segmentation of vector graphics documents |
US20190080164A1 (en) | 2017-09-14 | 2019-03-14 | Chevron U.S.A. Inc. | Classification of character strings using machine-learning |
CN110334346A (zh) | 2019-06-26 | 2019-10-15 | 京东数字科技控股有限公司 | 一种pdf文件的信息抽取方法和装置 |
US20210117667A1 (en) * | 2019-10-17 | 2021-04-22 | Adobe Inc. | Document structure identification using post-processing error correction |
US20210158093A1 (en) | 2019-11-21 | 2021-05-27 | Adobe Inc. | Automatically generating labeled synthetic documents |
CN111291619A (zh) * | 2020-01-14 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | 一种在线识别理赔单据中文字的方法、装置及客户端 |
CN111259830A (zh) | 2020-01-19 | 2020-06-09 | 中国农业科学院农业信息研究所 | 一种海外农业pdf文档内容碎片化方法及系统 |
CN113343658A (zh) | 2021-07-01 | 2021-09-03 | 湖南四方天箭信息科技有限公司 | 一种pdf文件信息抽取方法、装置以及计算机设备 |
Also Published As
Publication number | Publication date |
---|---|
EP4439494A1 (en) | 2024-10-02 |
CN118076982A (zh) | 2024-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI682302B (zh) | 風險地址識別方法、裝置以及電子設備 | |
WO2017097166A1 (zh) | 识别领域命名实体的方法及装置 | |
Wilkinson et al. | Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections | |
Fernández-Mota et al. | Bh2m: The barcelona historical, handwritten marriages database | |
Chen et al. | Information extraction from resume documents in pdf format | |
US11769341B2 (en) | System and method to extract information from unstructured image documents | |
Rausch et al. | Docparser: Hierarchical document structure parsing from renderings | |
CN113569050A (zh) | 基于深度学习的政务领域知识图谱自动化构建方法和装置 | |
CN108763192B (zh) | 用于文本处理的实体关系抽取方法及装置 | |
CN114118053A (zh) | 一种合同信息提取方法及装置 | |
Zharikov et al. | DDI-100: dataset for text detection and recognition | |
Tkaczyk | New methods for metadata extraction from scientific literature | |
Naoum et al. | Article segmentation in digitised newspapers with a 2d markov model | |
Boukhers et al. | Mexpub: Deep transfer learning for metadata extraction from german publications | |
Ghosh et al. | R-phoc: segmentation-free word spotting using cnn | |
Francois et al. | Text detection and post-OCR correction in engineering documents | |
CN112395407B (zh) | 企业实体关系的抽取方法、装置及存储介质 | |
Soykan et al. | A comprehensive gold standard and benchmark for comics text detection and recognition | |
Fernández et al. | Contextual word spotting in historical manuscripts using markov logic networks | |
Feild | Improving text recognition in images of natural scenes | |
Rusiñol et al. | Symbol Spotting in Digital Libraries | |
Mars et al. | Combination of DE-GAN with CNN-LSTM for Arabic OCR on Images with Colorful Backgrounds | |
WO2023092211A1 (pt) | Método para extração e estruturação de informações | |
BR102021023977A2 (pt) | Método para extração e estruturação de informações | |
Silcock et al. | A massive scale semantic similarity dataset of historical english |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22896898 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280067231.7 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022896898 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022896898 Country of ref document: EP Effective date: 20240626 |