CN114417798A

CN114417798A - Document structured extraction method and device, computer equipment and storage medium

Info

Publication number: CN114417798A
Application number: CN202210059932.8A
Authority: CN
Inventors: 丁家奎
Original assignee: Guangzhou Tiancom Information Technology Co ltd
Current assignee: Guangzhou Tiancom Information Technology Co ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-04-29

Abstract

The embodiment of the invention provides a document structured extraction method, a document structured extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: converting or decrypting the document to be extracted; performing OCR (optical character recognition) on the converted or decrypted file to automatically recognize and export an excel file, and importing the excel file into a database for storage; processing data of the excel table; based on the data characteristics, performing OCR recognition on the processed data to obtain an abnormal detection result, and manually correcting the detected abnormal detection result; and (5) standardizing the excel table after data processing. The method introduces an OCR technology to realize extraction of form text data and export the form text data to excel, carries out anomaly detection on an OCR recognition result based on a data characteristic rule, corrects the detection result to enable the data extraction result to be more accurate, identifies texts and forms through the OCR once, then carries out structured processing on the data once, and is higher in efficiency.

Description

Document structured extraction method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of document extraction, in particular to a document structured extraction method and device, computer equipment and a storage medium.

Background

In the prior art, the extraction of documents is generally designed according to specific business requirements, and data reports such as financial reports and the like adopt uniform format requirements. And the processing efficiency is low because a general extraction method or tool is not available for documents of different formats.

Disclosure of Invention

The embodiment of the invention provides a document structured extraction method, a document structured extraction device, computer equipment and a storage medium, and aims to solve the general extraction problem of documents with different formats.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

in a first aspect, the present invention provides a document structured extraction method, which includes the following steps:

converting or decrypting the document to be extracted;

performing OCR (optical character recognition) on the converted or decrypted file to automatically recognize and export an excel file, and importing the excel file into a database for storage;

processing data of the excel table;

based on the data characteristics, performing OCR recognition on the processed data to obtain an abnormal detection result, and manually correcting the detected abnormal detection result;

and (5) standardizing the excel table after data processing.

The step of processing the data of the excel table comprises the following steps:

extracting data of the table;

and performing data cleaning on the table.

Wherein the step of extracting data from the table comprises:

calculating a table boundary;

calculating a table unit;

the table title is calculated.

The step of cleaning the data of the excel table comprises the following steps:

cleaning the values of the cells based on a cleaning rule;

extracting the starting row and the ending row of all non-numerical value areas in the table;

extracting the initial columns and the end columns of all non-numerical regions in the table;

determining the numerical region boundary of the table non-numerical region function;

extracting a source index name;

and establishing a mapping index.

In a second aspect, an embodiment of the present invention provides a document structured extraction apparatus, including:

the document conversion and decryption unit is used for converting or decrypting the document to be extracted;

the document OCR recognition unit is used for carrying out OCR automatic recognition on the converted or decrypted file to export an excel file and importing the excel file into a database for storage;

the data processing unit is used for processing data of the excel form;

the anomaly detection unit is used for carrying out anomaly detection on the processed data by carrying out OCR recognition results based on the data characteristics and carrying out manual correction on the detected anomaly;

and the standardization unit is used for standardizing the washed and corrected excel table.

Wherein the data processing unit comprises:

the data extraction unit is used for extracting data of the table;

and the data cleaning unit is used for cleaning the table.

Wherein the data extraction unit includes:

a boundary calculation unit for calculating a table boundary;

a unit calculating unit for calculating a table unit;

and a title calculation unit for calculating the title of the table.

Wherein the data cleaning unit includes:

a cell value cleaning unit for cleaning the cell values based on a cleaning rule;

the non-numerical value row extraction unit is used for extracting the starting rows and the ending rows of all non-numerical value areas in the table;

the non-numerical value column extraction unit is used for extracting the initial columns and the end columns of all non-numerical value areas in the table;

the boundary determining unit is used for determining the numerical value region boundary acted by the non-numerical value region of the table;

a source index extraction unit for extracting a source index name;

and the mapping establishing unit is used for establishing a mapping index.

In a third aspect, an embodiment of the present invention provides a computer device, which includes a memory and a processor, where the memory stores a computer program thereon, and the processor implements the document structured extraction method as described above when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program can implement the document structured extraction method as described above.

Compared with the prior art, the embodiment of the invention provides a document structured extraction method, a document structured extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: converting or decrypting the document to be extracted; performing OCR (optical character recognition) on the converted or decrypted file to automatically recognize and export an excel file, and importing the excel file into a database for storage; processing data of the excel table; based on the data characteristics, performing OCR recognition on the processed data to obtain an abnormal detection result, and manually correcting the detected abnormal detection result; and (5) standardizing the excel table after data processing. The method introduces an OCR technology to realize extraction of form text data and export the form text data to excel, carries out anomaly detection on an OCR recognition result based on a data characteristic rule, corrects the detection result to enable the data extraction result to be more accurate, identifies texts and forms through the OCR once, then carries out structured processing on the data once, and is higher in efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a document structured extraction method according to an embodiment of the present invention;

FIG. 2 is a main flow chart of a document structured extraction method according to an embodiment of the present invention;

FIG. 3 is a sub-flowchart of a document structured extraction method according to an embodiment of the present invention;

FIG. 4 is a sub-flowchart of a document structured extraction method according to an embodiment of the present invention;

FIG. 5 is a sub-flowchart of a document structured extraction method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a document structured extraction apparatus provided by an embodiment of the present invention; and

FIG. 7 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 to 5, fig. 1 is a flowchart of a document structured extraction method according to an embodiment of the present invention, fig. 2 is a main flowchart, and fig. 3 to 5 are sub-flowcharts of the embodiment, where the document structured extraction method according to the embodiment of the present invention takes financial data extraction as an example, and the method includes the following steps:

s100, converting or decrypting a document to be extracted; uniformly converting format files with different formats such as docx, doc and the like into PDF (Portable document Format), decrypting some original PDF-format documents or converting the documents into the uniform format so as to facilitate subsequent OCR (optical character recognition);

s200, performing OCR (optical character recognition) on the converted or decrypted file to automatically recognize and export an excel file, and importing the excel file into a database for storage; adopting an OCR algorithm to automatically identify PDF format documents by compiling an RPA program and exporting excels (each page stores a workbook) to a specified directory; and various excel files are imported into a database for storage, namely the import of the whole excel file is supported, and the import of the abnormal page workbook is also supported.

Step S300, data processing is carried out on the excel form;

referring to fig. 3 again, the step S300 of processing the excel table includes:

s301, extracting data of the table; extracting the table boundary, the catalogue and the unit to realize the calculation of the initial row, the ending row, the initial column and the ending column of the table; extraction of table correspondence unit and title.

Referring to fig. 4 again, the step S301 of extracting data from the table includes:

step S3011, calculating a table boundary;

step S3012, calculating a table unit;

step S3013, calculates a table title.

Wherein:

the judgment basis of the table boundary is as follows:

the table is at least 2 rows and 2 columns;

the interval between the rows of the tables is at least 2 and the interval between the columns is at least 2

All cells of the table are contiguous;

the merged cells have the same cell value.

The judgment basis of the table unit is as follows:

judging based on the index unit key words;

in the first 4 rows of the table start row; if multiple matches occur, the text closest to the beginning line of the table is preferably taken.

Table title judgment basis:

judging based on the title keyword library;

in the first 4 rows of the table start row, the text closest to the table start row is preferably taken if multiple matches occur.

Step S302, data cleaning is carried out on the table, all cell values of the table are extracted based on the boundaries of the table, then the cell values are cleaned, after unit and date characters are replaced, non-numerical value areas and numerical value areas are divided according to the regular mode, the numerical value areas are index data, and the corresponding non-numerical value areas are index entity names.

Referring to fig. 5 again, the step S302 of performing data cleansing on the excel table includes:

step S3021, cleaning the values of the cells based on a cleaning rule;

step S3022, extracting the start line and the end line of all the non-numerical value areas in the table;

step S3023, extracting the initial columns and the end columns of all the non-numerical regions in the table;

step S3024 determining a numerical region boundary on which a table non-numerical region acts;

wherein:

cell cleaning rules: cells marked as strings, characters removed such as: "(", ")", ""% ", etc.;

cell type correction: and correcting the cell type mark during import as a character string, considering the cell as a date type if the cleaned cell meets the date format, and correcting the cell as a numerical type if the cleaned cell meets numerical values.

Non-numerical region: namely, the corrected cell type is not a numerical value type or is a non-numerical value area except a null value;

the non-numerical value area is divided into column description and row description; a table may have multiple row descriptions or multiple column descriptions;

action area described by rows and columns of the table:

each row in the table describes the region of action: the column sequence number is greater than the end column sequence number of the row description, if the row description is the last row description of the table, the column sequence number is less than or equal to the end column of the table, otherwise, the column sequence number is greater than the starting column of the next row description;

each column in the table describes the region of action: the line sequence number is larger than the ending line sequence number of the column change description, and is smaller than or equal to the ending line of the table if the column change description is the last column description of the table, otherwise, the line sequence number is larger than the starting line of the next column description.

Step S3025, extracting a source index name; cleaning the non-numerical value region extracted in the step S300, replacing numerical values and serial numbers in the cells, and continuously splicing row descriptions and column descriptions according to the serial numbers of rows and columns;

wherein, the source index name: a line description column description or a column description row description;

step S3026, a mapping index is established, that is, the user maps the index in step S300.

S400, based on the data characteristics, performing OCR recognition on the processed data to obtain an abnormal detection result, and manually correcting the detected abnormal detection result; the corrected excel form needs to return to step S200 for reprocessing.

And S500, standardizing the excel table after data processing. And according to the index mapping relation and the table value data extracted in the step S300, realizing the conversion of the index data and cleaning the data date and the key information of the index unit of the index.

Referring to fig. 3, an embodiment of the present invention provides a document structured extraction apparatus 100, which includes:

a document conversion decryption unit 101 for converting or decrypting a document to be extracted;

the document OCR recognition unit 102 is used for carrying out OCR automatic recognition on the converted or decrypted file to export an excel file and importing the excel file into a database for storage;

the data processing unit 103 is used for processing data of the excel table;

an anomaly detection unit 104, configured to perform anomaly detection on the processed data based on the data features and perform manual correction on the detected anomaly;

a normalization unit 105 for normalizing the washed and corrected excel form.

Wherein the data processing unit 103 comprises:

a data extraction unit 1031, configured to perform data extraction on the table;

and a data cleaning unit 1032, configured to perform data cleaning on the table.

Wherein the data extraction unit 1031 includes:

a boundary calculation unit 10311 for calculating a table boundary;

a unit calculation unit 10312 for calculating a table unit;

a title calculation unit 10313, configured to calculate a table title.

Wherein the data cleaning unit 1032 includes:

a cell value cleaning unit 10321 for cleaning the cell values based on the cleaning rule;

a non-numerical value row extracting unit 10322 configured to extract a start row and an end row of all non-numerical value regions in the table;

a non-numerical value column extraction unit 10323 configured to extract start columns and end columns of all non-numerical value regions in the table;

a boundary determining unit 10324 for determining a numerical region boundary of the table on which the non-numerical region acts;

a source index extraction unit 10325 for extracting a source index name;

a mapping establishing unit 10326, configured to establish a mapping index.

Referring to fig. 6, an embodiment of the present invention provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program. The program instructions include:

s100, converting or decrypting a document to be extracted;

s200, performing OCR (optical character recognition) on the converted or decrypted file to automatically recognize and export an excel file, and importing the excel file into a database for storage;

step S300, data processing is carried out on the excel form;

s400, based on the data characteristics, performing OCR recognition on the processed data to obtain an abnormal detection result, and manually correcting the detected abnormal detection result;

and S500, standardizing the excel table after data processing.

The computer equipment can be a terminal or a server, wherein the terminal can be an electronic equipment with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and a wearable equipment. The server may be an independent server or a server cluster composed of a plurality of servers.

The computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer programs 5032 include program instructions that, when executed, cause the processor 502 to perform a document structured extraction method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute a document structured extraction method.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Embodiments of the present invention also provide a storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the above-described method. The program instructions include the steps of:

Step S300, data processing is carried out on the excel form;

step S3011, calculating a table boundary;

step S3012, calculating a table unit;

step S3013, calculates a table title.

step S3021, cleaning the values of the cells based on a cleaning rule;

The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media.

Compared with the prior art, the embodiment of the invention provides a document structured extraction method, a document structured extraction device, computer equipment and a storage medium, wherein an OCR (optical character recognition) abnormity detection program is used for correcting a recognition result through a detection result so that structured data are more accurate; the text and the table are recognized by OCR for the text file at one time, and then all the table data are subjected to structuring processing at one time, so that the data extraction processing efficiency is improved.

The above-mentioned embodiments are merely preferred examples of the present invention, and not intended to limit the present invention, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present invention, so that the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A document structured extraction method is characterized by comprising the following steps:

converting or decrypting the document to be extracted;

processing data of the excel table;

and (5) standardizing the excel table after data processing.

2. The document structured extraction method according to claim 1, wherein the step of processing data of the excel form comprises:

extracting data of the table;

and performing data cleaning on the table.

3. The document structured extraction method of claim 2, wherein the step of performing data extraction on the table comprises:

calculating a table boundary;

calculating a table unit;

the table title is calculated.

4. The document structured extraction method according to claim 2, wherein the step of performing data cleansing on the excel form comprises:

cleaning the values of the cells based on a cleaning rule;

extracting a source index name;

and establishing a mapping index.

5. A document structured extraction device is characterized by comprising the following units:

the data processing unit is used for processing data of the excel form;

6. The document structured extraction apparatus according to claim 5, wherein said data processing unit includes:

the data extraction unit is used for extracting data of the table;

and the data cleaning unit is used for cleaning the table.

7. The document structured extraction apparatus according to claim 6, wherein the data extraction unit includes:

a boundary calculation unit for calculating a table boundary;

a unit calculating unit for calculating a table unit;

and a title calculation unit for calculating the title of the table.

8. The document structured extraction device according to claim 6, wherein the data cleaning unit includes:

a source index extraction unit for extracting a source index name;

and the mapping establishing unit is used for establishing a mapping index.

9. A computer device, characterized in that the computer device comprises a memory and a processor, the memory stores a computer program, the processor realizes the document structured extraction method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the document structured extraction method according to any one of claims 1 to 4.