CN113672739A

CN113672739A - Data extraction method for image format financial and newspaper document

Info

Publication number: CN113672739A
Application number: CN202110856109.5A
Authority: CN
Inventors: 江琪; 高翔; 纪达麒; 陈运文
Original assignee: Daguan Intelligent Shenzhen Co ltd
Current assignee: Daguan Intelligent Shenzhen Co ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-19

Abstract

The invention discloses a data extraction method of an image format financial report document, which comprises the following steps: extracting subject data; checking the extracted subject data through a configured financial and newspaper service formula; and correcting the subject data which does not accord with the subject data after the verification by adopting a plurality of error correction methods, wherein the error correction methods comprise: decimal point error correction, combination ocr candidate set error correction, combination of financial and newspaper service formulas and unmatched item error correction, neural network error correction based and ant colony algorithm based error correction.

Description

Data extraction method for image format financial and newspaper document

Technical Field

The invention belongs to the field of text processing, and particularly relates to a data extraction method for an image format financial and newspaper document.

Background

With the continuous development of new technologies such as cloud computing, big data and the like, financial requirements are vigorous, online financial services are developed rapidly, and the daily carried transaction amount of a financial information system is huge and rises at a high speed; on the other hand, the financial industry is in an environment with accelerated change and increasingly intense competition, each financial institution and internet enterprise hope to expand the business scale through business concentration and data concentration, the informatization construction of the financial industry is scheduled increasingly, meanwhile, the financial industry has a very important financial affair audit, the traditional manual input and proofreading are still used, manpower is wasted, and meanwhile, due to the diversification of templates, people with rich experience are needed for checking, so that a mature and automatic financial affair inputting and proofreading system is necessary. There are problems including:

1 diversification of the subject of financial reports

Due to the diversity of the subject and the format of the financial reports, different enterprises and different organizations have a large number of different financial report formats. And often different organizations have different expressions for subjects with uniform meanings.

2 sample quality problems lead to ocr identification problems

Because of some pdf sample quality issues, ocr has errors in identifying text, numbers. Common examples are: the number is identified as another similar number, or as a Chinese character, or some other character; identification of thousands of bits and decimal points; the Chinese characters are identified as messy codes, and the Chinese characters are not identified.

Diversification of 3 financial report sign calculation

Under the condition of introducing negative numbers into some subject numbers, the subjects themselves have the word eyes of adding/subtracting representing operation, and under the two constraint effects, the calculation mode has much flexibility.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a data extraction method of an image format financial and newspaper document.

In order to achieve the purpose, the invention adopts the following technical scheme:

a data extraction method of an image format financial report document comprises the following steps: extracting subject data; checking the extracted subject data through a configured financial and newspaper service formula; and correcting the subject data which does not accord with the subject data after the verification by adopting a plurality of error correction methods, wherein the error correction methods comprise: decimal point error correction, combination ocr candidate set error correction, combination of financial and newspaper service formulas and unmatched item error correction, neural network error correction based and ant colony algorithm based error correction.

Preferably, the subject data is stored in the form of triples, including: subject, subject time, subject amount.

Preferably, the extracting subject data includes: and matching the subject identified in the document to be extracted through the subject character strings in the extraction template so as to obtain the subject time and the subject amount corresponding to the subject.

Preferably, the synonym dictionary is configured through a domain dictionary, and the subject character strings are partially or completely replaced.

Preferably, the decimal point error correction includes: preprocessing upper and lower data; calculating the decimal proportion of all columns; the decimal point of the value is inferred from the ratio.

Preferably, the combining ocr candidate set error correction includes: the replacement check is performed to determine whether the balance is balanced by ocr identifying a high probability candidate set of subject amounts.

Preferably, the error correction combining the financial and newspaper service formula and the unmatched item comprises: preprocessing a financial and newspaper service formula and an unmatched original text; constructing all combinations of unmatched items; and solving the optimal combination according to the formula difference.

Preferably, the error correction combining the financial and newspaper service formula and the unmatched item further comprises: preprocessing a financial and newspaper service formula and a formula subject numerical value; constructing all combinations of positive and negative values of the subjects; and solving the optimal combination according to the formula difference.

Preferably, the neural network-based error correction comprises: constructing a data set according to the existing corpora; training a reasoning model according to the characteristics of subject context semantics, positions and the like; and verifying the correctness of the formula by combining the reasoning result with the amount numerical value, and applying the formula if the formula is successful.

Preferably, the ant colony algorithm-based error correction includes: based on the subject amount information and the subject information characteristics of the unmatched items, preprocessing and filtering junk characters; constructing an ant colony algorithm through the existing information; solving the optimal front k combinations according to the formula difference; calculating an editing distance according to the information of the corpus, and filtering the non-conforming combinations; and scoring and sequencing according to the similarity of all subjects to obtain an optimal solution.

Compared with the prior art, the invention has the beneficial effects that:

1, constructing a domain dictionary by using domain business knowledge for matching;

2, applying various error correction strategies by combining ocr data, a formula used for template verification and unmatched items, aiming at the problems of ocr identification problems and non-specification of financial reports, diversified financial report structures and diversified calculation modes;

and 3, aiming at the formula which cannot be balanced, obtaining a formula balancing suggested term by using an unmatched term and an algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

The embodiment provides an extraction method, and an extraction auditing system combined with multiple error correction strategies, wherein the system uses multiple extraction matching methods and multiple error correction modes, and uses a general financial report formula to correct ocr the identified problems.

The extraction error correction system is mainly divided into four parts: the system comprises an extraction module, a verification module, a rescue strategy module and a data export module.

5.1 extraction Module

The extraction module is mainly used for extracting the triples (time, subjects, amount of money, eg: 2019, business cost, 10 hundred million yuan) to facilitate the subsequent checking and rescuing process.

And (3) extraction flow:

cleaning subject data and removing some interference character items.

And secondly, extracting data by applying an extraction strategy.

Common extraction strategy

Subject directly carries out character string level comparison and matching.

And secondly, configuring a synonym dictionary by utilizing the domain dictionary.

5.2 verification Module

The module is mainly used for feeding back the extraction result through the verification result, whether the formula is balanced or not, and algebra and difference values on two sides of a formula equal sign.

5.3 rescue error correction module

Rescue error correction module includes five parts:

and (4) error correction of decimal points.

And ocr combining the candidate set for error correction.

Combining formula and unmatched item.

And fourthly, correcting errors based on the neural network.

Ant colony algorithm.

The rescue error correction strategy is mainly divided into the following parts:

5.3.1 decimal point error correction

For some fraction loss problems due to ocr recognition, context is used for error correction, including:

preprocessing upper and lower data.

And 2, calculating the decimal proportion of all the columns.

And thirdly, deducing decimal points of the numerical value according to the proportion.

5.3.2 error correction in connection with ocr the identified candidate set

The method specifically comprises the following steps:

using the non-flat formula, some subjects ocr identify a high probability candidate set of values and perform a replacement check to determine whether the balance is good.

5.3.3 combining formula with unmatched terms

5.3.3.1 combining the formula with unmatched original text and unfilled formula template subjects to correct errors

Preprocessing the unmatched original text of the formula.

② construct all combinations of unmatched items.

And thirdly, solving the optimal combination according to the formula difference.

5.3.3.2 combining with formula, the sum of subjects is corrected and balanced by positive and negative values

Preprocessing a formula and a formula subject numerical value.

② all combinations of positive and negative values of the subjects are constructed.

5.3.4 error correction method based on neural network

Firstly, a data set is constructed according to the existing linguistic data.

Secondly, training a reasoning model according to the characteristics of subject context semantics, positions and the like. This model can infer subject semantics from contextual features.

And thirdly, verifying the correctness of the formula by combining the reasoning result with the amount value, and applying the formula if the formula is successful.

5.3.5 ant colony algorithm based on unmatched items

Preprocessing and filtering junk characters based on characteristics of amount information, subject information and the like of unmatched items.

Secondly, an ant colony algorithm is constructed through the existing information.

And thirdly, solving the best first k combinations according to the formula difference.

Fourthly, according to the information of the corpus, the edit distance is calculated, and the non-conforming combination is filtered.

And fifthly, scoring and sorting according to the similarity of all subjects to obtain the optimal solution.

5.4 data export Module

The module is used for generating the following data according to the verification result:

balancing formula and non-balancing formula.

② unmatched items.

Thirdly, using the unmatched items to construct a combination which accords with formula balance so as to balance the formula.

The system application scenario is as follows: in order to enter financial reports of a plurality of enterprises, the user A applies the system and automatically enters and corrects the financial reports. And the user name is automatically modified according to the content displayed by the result and is finally recorded into the wind control system of the user.

Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims

1. A data extraction method for a financial document in an image format is characterized by comprising the following steps:

extracting subject data;

checking the extracted subject data through a configured financial and newspaper service formula;

and correcting the subject data which does not accord with the subject data after the verification by adopting a plurality of error correction methods, wherein the error correction methods comprise: decimal point error correction, combination ocr candidate set error correction, combination of financial and newspaper service formulas and unmatched item error correction, neural network error correction based and ant colony algorithm based error correction.

2. The method for extracting data from an image-formatted financial document according to claim 1, wherein the subject data is stored in a triplet form, including: subject, subject time, subject amount.

3. The data extraction method of an image format financial document according to claim 2, wherein said extracting subject data includes: and matching the subject identified in the document to be extracted through the subject character strings in the extraction template so as to obtain the subject time and the subject amount corresponding to the subject.

4. The method for extracting data from a financial document in image format according to claim 3, wherein a synonym dictionary is configured from the domain dictionary, and the subject character string is partially or completely replaced.

5. The method of extracting data from an image format financial document according to claim 1, wherein said decimal point error correction comprises: preprocessing upper and lower data; calculating the decimal proportion of all columns; the decimal point of the value is inferred from the ratio.

6. The method of extracting data from an image format financial document of claim 1 wherein said error correcting in combination ocr with a candidate set comprises: the replacement check is performed to determine whether the balance is balanced by ocr identifying a high probability candidate set of subject amounts.

7. The method of claim 1, wherein said error correction in combination with the financial transaction formula and the unmatched terms comprises:

preprocessing a financial and newspaper service formula and an unmatched original text; constructing all combinations of unmatched items; and solving the optimal combination according to the formula difference.

8. The method of claim 7, wherein said error correction in combination with the financial transaction formula and the unmatched terms further comprises:

preprocessing a financial and newspaper service formula and a formula subject numerical value; constructing all combinations of positive and negative values of the subjects; and solving the optimal combination according to the formula difference.

9. The method of claim 1, wherein the neural network based error correction comprises: constructing a data set according to the existing corpora; training a reasoning model according to the characteristics of subject context semantics, positions and the like; and verifying the correctness of the formula by combining the reasoning result with the amount numerical value, and applying the formula if the formula is successful.

10. The method of claim 1, wherein the ant colony algorithm-based error correction comprises: based on the subject amount information and the subject information characteristics of the unmatched items, preprocessing and filtering junk characters; constructing an ant colony algorithm through the existing information; solving the optimal front k combinations according to the formula difference; calculating an editing distance according to the information of the corpus, and filtering the non-conforming combinations; and scoring and sequencing according to the similarity of all subjects to obtain an optimal solution.