CN114492345A

CN114492345A - Method and device for identifying and extracting index data of electronic forms in report

Info

Publication number: CN114492345A
Application number: CN202111613661.8A
Authority: CN
Inventors: 陆培丽
Original assignee: Ruige Artificial Intelligence Technology Co ltd
Current assignee: Ruige Artificial Intelligence Technology Co ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-05-13

Abstract

A method for identifying and extracting index data of electronic forms in reports comprises the steps of processing electronic forms in received report documents, obtaining index tag and value of required index data, determining a boundary between the tag and the value of the electronic form index data, and carrying out tag assignment on the values one by one. The search of the tag-value boundary is carried out from the right to the bottom one by one until a unique point which makes the tag of each value unique is found, wherein the tag of each value, which is positioned to the right of the upper column boundary, is the tag set of the value. In the process of processing the spreadsheet, the process of regulating the spreadsheet format is also included.

Description

Method and device for identifying and extracting index data of electronic forms in report

Technical Field

The invention belongs to the technical field of electronic document processing, and particularly relates to a method and a device for identifying and extracting index data of an electronic form in a report.

Background

At present, business data such as annual newspapers, ESG reports, social responsibility reports and a plurality of public opinion information sources which are independently disclosed by various marketing companies are important analysis bases of the industry and are also the core of digital transformation. The formats of various report documents are mainly in the form of tables.

The grid is used as a key presentation form of data such as finance and the like, is simple and easy to use, is a very common representation diagram, and contains rich and precious statistical and experimental data in the table, so that the grid has great attraction to the fields of information extraction, data mining and the like.

Usually, a project contains a great number of EXCEL tables, and each table has a different form, and the manual data processing is extraordinarily troublesome. Taking the current financial statement as an example, the department needs to measure the overall enterprise strength through the enterprise operation condition in the statement and score the enterprise business condition. If key contents are searched manually, the annual financial statement key indexes are easy to have audit errors or missing, and the fairness of the indexes is seriously influenced.

Under the increasingly competitive market environment, the attention of enterprises to work efficiency, accuracy, input cost and the like is increasing day by day, the financial industry has high requirements on data, and if the data extracted from a table is different from the original data by only a few numbers or characters, the result caused by using wrong data may be greatly deviated from the correct result. If the deviation data caused by the extraction error is used carelessly, a great loss can be even caused to the user.

Therefore, how to effectively extract structured information from the original data in batch, realize automation of most business processes, data system management and repeated effective comprehensive utilization, reduce manual input and intervention, improve the accuracy and efficiency of processing business, save a large amount of manual labor to reduce the psychological burden of engaging in complicated mechanical input work becomes the problem to be solved at present.

Disclosure of Invention

One embodiment of the invention provides a model for identifying excel table indexes in a document and extracting the indexes. The model assigns the tags to the values one by finding the tag-value boundary of the table data, wherein the search of the tag-value boundary is judged from the right to the bottom one by one until finding the unique point for the tag of each value, and the tag of the row boundary of each value to the right of the column boundary is the tag set of the value.

The embodiment of the invention has the advantages that the value tag can be found out without adding too many rules, and meanwhile, the tag is ensured not to be redundant.

Detailed Description

Spreadsheets, such as EXCEL, while seemingly straightforward, are complex in structure and difficult to extract information. For example, a table extraction method in the prior art is based on Python language, and proposes a general method for extracting table index data in batches, and the method forms an extraction model. Because the model is a general table index recognition and extraction algorithm model, which is more focused on general use, the accuracy may not be particularly high. Table formats some tables do not have obvious rule commonality because of their variability. For the table which is too sparse, the algorithm model has the condition that dividing line division is not accurate enough, so that some index data are not accurate enough and the value of the index data is missing.

According to one or more embodiments, a method for extracting metric data for spreadsheet identification in a report includes the following processing steps.

Step one, processing the spreadsheet in the received report document to obtain the index tag and the value of the required index data. The specific process is that,

using python, the batch of excel forms is read by its own module of data analysis, pandas, and traversed through each of the workbooks in turn.

Due to the fact that the excel file table has a plurality of cells, the read result has the situations of data loss and irregular data format. Therefore, a first function for normalization of excel table format reduction needs to be written, the problem of format irregularity caused by merging cells is solved by using the function, and the contents of the excel table file object subjected to the step are more standard and easy to process.

Writing a second function, wherein the function has the function of firstly finding a boundary between the index tag of the table data and the value, and then carrying out index tag assignment on the value one by one. The index tag and value dividing line is searched from the right and judged one by one until a unique point of the index tag corresponding to each value is found, wherein the index tag of each value is the index tag set of the value by the index tag which is arranged above the column dividing line and is right.

Therefore, the method and the device have the advantages that the index tag of the value can be found without adding too many rules, and meanwhile, the index tag is ensured not to be redundant. The extraction model of the invention has good universality and is not limited by the specific structure of the table.

In order to make the final data visible and generalize, the table extraction model stores the batch excel file data into the same table. Each field has its own meaning. For example, the final results generated are:

the dataname column: string, this field is the index name (the combination of row and column coordinates).

Value column: numerical values (row and column coordinates are located to unique values for data analysis).

Year column: numerical values (the years in the index are extracted separately for later time series analysis).

File _ name column: excel file name, file classification by the name and the like, and the columns are main table extraction result columns.

The form extraction result of the embodiment of the invention can carry out special data cleaning and analysis according to the requirements of different users. Due to the universality and high extraction rate, fields required by users can be searched in the fields. But where there are certainly fields that are not needed by the user, a partial data cleaning is first performed according to the rules that the model itself has, for example, cleaned the revenue related fields and their values in the annual offer of a number of listed companies 2020. Of course, corresponding rules can be formulated according to the requirements of the user to meet different requirements.

Therefore, the method and the device have the advantages that various index data of the table such as excel can be extracted, analysis and prediction are convenient for a user, and further, the general index data of the table obtained based on the algorithm model can be further cleaned, so that indexes are accurate.

The invention provides a model for identifying and extracting index data of an electronic form in a report, which is a general form index identification and extraction algorithm model, can process all EXCEL forms, can completely extract indexes, and does not have the problem of omission. The invention effectively extracts data and structured information, realizes the automation of business process, identifies and extracts the form index, identifies and extracts annual newspaper, ESG report, social responsibility report and a plurality of public opinion information sources which are independently disclosed by listed companies, and basically can comprehensively cover all index data. The user can put forward the requirement based on the general index data of extraction, and further data cleaning is carried out to obtain the final desired index.

The model of the invention has higher intelligent level and better superiority in precision and execution time. The user can put forward the requirement based on the general index data of extraction, and further data cleaning is carried out to obtain the final desired index. By the model, index information of the table can be quickly acquired, and data analysis can be further performed. And finally, analyzing the extraction results of the table indexes of the financial reports, the research reports and the social responsibility reports.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying and extracting index data of an electronic form in a report is characterized by comprising the following steps,

processing the spreadsheet in the received report document to obtain the index tag and value of the required index data,

determining a boundary of tag-value of the spreadsheet index data, and performing tag assignment on the values one by one.

2. The method of claim 1, wherein the search of the tag-value boundary is determined from right to bottom until finding a unique point that makes the tag of each value unique, wherein the tag of the row boundary of each value to the right of the column boundary is the tag set of the value.

3. The method of claim 1, further comprising a process for formatting the spreadsheet.

4. The method of claim 1, wherein after the plurality of electronic forms are batch processed, the obtained index data is stored in a summary table, and the summary table comprises index names, index data, time or file names.

5. An apparatus for extracting metric data for spreadsheet identification in a report, the apparatus comprising a memory; and

a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor to:

6. The apparatus of claim 5, wherein the search for the tag-value boundary is determined from right to bottom until finding a unique point that makes the tag of each value unique, wherein the tag of the row boundary of each value to the right of the column boundary is the tag set of the value.

7. The apparatus of claim 5, further comprising a process for formatting the spreadsheet.

8. The apparatus of claim 5, wherein after the batch processing of the plurality of electronic forms, the obtained index data is stored in a summary table, and the summary table comprises index names, index data, time or file names.

9. A storage medium on which a computer program is stored which, when executed by a processor, carries out the method of any one of claims 1 to 4.