CN114492345A - Method and device for identifying and extracting index data of electronic forms in report - Google Patents

Method and device for identifying and extracting index data of electronic forms in report Download PDF

Info

Publication number
CN114492345A
CN114492345A CN202111613661.8A CN202111613661A CN114492345A CN 114492345 A CN114492345 A CN 114492345A CN 202111613661 A CN202111613661 A CN 202111613661A CN 114492345 A CN114492345 A CN 114492345A
Authority
CN
China
Prior art keywords
tag
value
index data
boundary
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111613661.8A
Other languages
Chinese (zh)
Inventor
陆培丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ruige Artificial Intelligence Technology Co ltd
Original Assignee
Ruige Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ruige Artificial Intelligence Technology Co ltd filed Critical Ruige Artificial Intelligence Technology Co ltd
Priority to CN202111613661.8A priority Critical patent/CN114492345A/en
Publication of CN114492345A publication Critical patent/CN114492345A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method for identifying and extracting index data of electronic forms in reports comprises the steps of processing electronic forms in received report documents, obtaining index tag and value of required index data, determining a boundary between the tag and the value of the electronic form index data, and carrying out tag assignment on the values one by one. The search of the tag-value boundary is carried out from the right to the bottom one by one until a unique point which makes the tag of each value unique is found, wherein the tag of each value, which is positioned to the right of the upper column boundary, is the tag set of the value. In the process of processing the spreadsheet, the process of regulating the spreadsheet format is also included.

Description

Method and device for identifying and extracting index data of electronic forms in report
Technical Field
The invention belongs to the technical field of electronic document processing, and particularly relates to a method and a device for identifying and extracting index data of an electronic form in a report.
Background
At present, business data such as annual newspapers, ESG reports, social responsibility reports and a plurality of public opinion information sources which are independently disclosed by various marketing companies are important analysis bases of the industry and are also the core of digital transformation. The formats of various report documents are mainly in the form of tables.
The grid is used as a key presentation form of data such as finance and the like, is simple and easy to use, is a very common representation diagram, and contains rich and precious statistical and experimental data in the table, so that the grid has great attraction to the fields of information extraction, data mining and the like.
Usually, a project contains a great number of EXCEL tables, and each table has a different form, and the manual data processing is extraordinarily troublesome. Taking the current financial statement as an example, the department needs to measure the overall enterprise strength through the enterprise operation condition in the statement and score the enterprise business condition. If key contents are searched manually, the annual financial statement key indexes are easy to have audit errors or missing, and the fairness of the indexes is seriously influenced.
Under the increasingly competitive market environment, the attention of enterprises to work efficiency, accuracy, input cost and the like is increasing day by day, the financial industry has high requirements on data, and if the data extracted from a table is different from the original data by only a few numbers or characters, the result caused by using wrong data may be greatly deviated from the correct result. If the deviation data caused by the extraction error is used carelessly, a great loss can be even caused to the user.
Therefore, how to effectively extract structured information from the original data in batch, realize automation of most business processes, data system management and repeated effective comprehensive utilization, reduce manual input and intervention, improve the accuracy and efficiency of processing business, save a large amount of manual labor to reduce the psychological burden of engaging in complicated mechanical input work becomes the problem to be solved at present.
Disclosure of Invention
One embodiment of the invention provides a model for identifying excel table indexes in a document and extracting the indexes. The model assigns the tags to the values one by finding the tag-value boundary of the table data, wherein the search of the tag-value boundary is judged from the right to the bottom one by one until finding the unique point for the tag of each value, and the tag of the row boundary of each value to the right of the column boundary is the tag set of the value.
The embodiment of the invention has the advantages that the value tag can be found out without adding too many rules, and meanwhile, the tag is ensured not to be redundant.
Detailed Description
Spreadsheets, such as EXCEL, while seemingly straightforward, are complex in structure and difficult to extract information. For example, a table extraction method in the prior art is based on Python language, and proposes a general method for extracting table index data in batches, and the method forms an extraction model. Because the model is a general table index recognition and extraction algorithm model, which is more focused on general use, the accuracy may not be particularly high. Table formats some tables do not have obvious rule commonality because of their variability. For the table which is too sparse, the algorithm model has the condition that dividing line division is not accurate enough, so that some index data are not accurate enough and the value of the index data is missing.
According to one or more embodiments, a method for extracting metric data for spreadsheet identification in a report includes the following processing steps.
Step one, processing the spreadsheet in the received report document to obtain the index tag and the value of the required index data. The specific process is that,
using python, the batch of excel forms is read by its own module of data analysis, pandas, and traversed through each of the workbooks in turn.
Due to the fact that the excel file table has a plurality of cells, the read result has the situations of data loss and irregular data format. Therefore, a first function for normalization of excel table format reduction needs to be written, the problem of format irregularity caused by merging cells is solved by using the function, and the contents of the excel table file object subjected to the step are more standard and easy to process.
Writing a second function, wherein the function has the function of firstly finding a boundary between the index tag of the table data and the value, and then carrying out index tag assignment on the value one by one. The index tag and value dividing line is searched from the right and judged one by one until a unique point of the index tag corresponding to each value is found, wherein the index tag of each value is the index tag set of the value by the index tag which is arranged above the column dividing line and is right.
Therefore, the method and the device have the advantages that the index tag of the value can be found without adding too many rules, and meanwhile, the index tag is ensured not to be redundant. The extraction model of the invention has good universality and is not limited by the specific structure of the table.
In order to make the final data visible and generalize, the table extraction model stores the batch excel file data into the same table. Each field has its own meaning. For example, the final results generated are:
the dataname column: string, this field is the index name (the combination of row and column coordinates).
Value column: numerical values (row and column coordinates are located to unique values for data analysis).
Year column: numerical values (the years in the index are extracted separately for later time series analysis).
File _ name column: excel file name, file classification by the name and the like, and the columns are main table extraction result columns.
The form extraction result of the embodiment of the invention can carry out special data cleaning and analysis according to the requirements of different users. Due to the universality and high extraction rate, fields required by users can be searched in the fields. But where there are certainly fields that are not needed by the user, a partial data cleaning is first performed according to the rules that the model itself has, for example, cleaned the revenue related fields and their values in the annual offer of a number of listed companies 2020. Of course, corresponding rules can be formulated according to the requirements of the user to meet different requirements.
Therefore, the method and the device have the advantages that various index data of the table such as excel can be extracted, analysis and prediction are convenient for a user, and further, the general index data of the table obtained based on the algorithm model can be further cleaned, so that indexes are accurate.
The invention provides a model for identifying and extracting index data of an electronic form in a report, which is a general form index identification and extraction algorithm model, can process all EXCEL forms, can completely extract indexes, and does not have the problem of omission. The invention effectively extracts data and structured information, realizes the automation of business process, identifies and extracts the form index, identifies and extracts annual newspaper, ESG report, social responsibility report and a plurality of public opinion information sources which are independently disclosed by listed companies, and basically can comprehensively cover all index data. The user can put forward the requirement based on the general index data of extraction, and further data cleaning is carried out to obtain the final desired index.
The model of the invention has higher intelligent level and better superiority in precision and execution time. The user can put forward the requirement based on the general index data of extraction, and further data cleaning is carried out to obtain the final desired index. By the model, index information of the table can be quickly acquired, and data analysis can be further performed. And finally, analyzing the extraction results of the table indexes of the financial reports, the research reports and the social responsibility reports.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A method for identifying and extracting index data of an electronic form in a report is characterized by comprising the following steps,
processing the spreadsheet in the received report document to obtain the index tag and value of the required index data,
determining a boundary of tag-value of the spreadsheet index data, and performing tag assignment on the values one by one.
2. The method of claim 1, wherein the search of the tag-value boundary is determined from right to bottom until finding a unique point that makes the tag of each value unique, wherein the tag of the row boundary of each value to the right of the column boundary is the tag set of the value.
3. The method of claim 1, further comprising a process for formatting the spreadsheet.
4. The method of claim 1, wherein after the plurality of electronic forms are batch processed, the obtained index data is stored in a summary table, and the summary table comprises index names, index data, time or file names.
5. An apparatus for extracting metric data for spreadsheet identification in a report, the apparatus comprising a memory; and
a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor to:
processing the spreadsheet in the received report document to obtain the index tag and value of the required index data,
determining a boundary of tag-value of the spreadsheet index data, and performing tag assignment on the values one by one.
6. The apparatus of claim 5, wherein the search for the tag-value boundary is determined from right to bottom until finding a unique point that makes the tag of each value unique, wherein the tag of the row boundary of each value to the right of the column boundary is the tag set of the value.
7. The apparatus of claim 5, further comprising a process for formatting the spreadsheet.
8. The apparatus of claim 5, wherein after the batch processing of the plurality of electronic forms, the obtained index data is stored in a summary table, and the summary table comprises index names, index data, time or file names.
9. A storage medium on which a computer program is stored which, when executed by a processor, carries out the method of any one of claims 1 to 4.
CN202111613661.8A 2021-12-27 2021-12-27 Method and device for identifying and extracting index data of electronic forms in report Pending CN114492345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111613661.8A CN114492345A (en) 2021-12-27 2021-12-27 Method and device for identifying and extracting index data of electronic forms in report

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111613661.8A CN114492345A (en) 2021-12-27 2021-12-27 Method and device for identifying and extracting index data of electronic forms in report

Publications (1)

Publication Number Publication Date
CN114492345A true CN114492345A (en) 2022-05-13

Family

ID=81496236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111613661.8A Pending CN114492345A (en) 2021-12-27 2021-12-27 Method and device for identifying and extracting index data of electronic forms in report

Country Status (1)

Country Link
CN (1) CN114492345A (en)

Similar Documents

Publication Publication Date Title
US20230385321A1 (en) Systems and methods for processing a natural language query in data tables
US8315997B1 (en) Automatic identification of document versions
CN110597870A (en) Enterprise relation mining method
CN110162754B (en) Method and equipment for generating post description document
CN112926299B (en) Text comparison method, contract review method and auditing system
CN115061721A (en) Report generation method and device, computer equipment and storage medium
CN112328589B (en) Electronic form data granulation and index standardization processing method
CN112000656A (en) Intelligent data cleaning method and device based on metadata
Hamad et al. An enhanced technique to clean data in the data warehouse
CN112651218A (en) Automatic generation method and management method of bidding document, medium and computer
Li et al. Cracking tabular presentation diversity for automatic cross-checking over numerical facts
CN112989791A (en) Duplication eliminating method, system and medium based on text information extraction result
CN111291547B (en) Template generation method, device, equipment and medium
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
US10877998B2 (en) Highly atomized segmented and interrogatable data systems (HASIDS)
CN116469500A (en) Data quality control method and system based on post-structuring of medical document
CN114492345A (en) Method and device for identifying and extracting index data of electronic forms in report
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
CN115098585A (en) Automatic law and regulation data processing method and system based on big data
CN110941952A (en) Method and device for perfecting audit analysis model
CN110909112B (en) Data extraction method, device, terminal equipment and medium
US11170164B2 (en) System and method for cell comparison between spreadsheets
CN115617790A (en) Data warehouse creation method, electronic device and storage medium
CN112182184A (en) Audit database-based accurate matching search method
CN111258953A (en) Method for converting financial data into assessment data for standardization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination