CN113672739A - Data extraction method for image format financial and newspaper document - Google Patents

Data extraction method for image format financial and newspaper document Download PDF

Info

Publication number
CN113672739A
CN113672739A CN202110856109.5A CN202110856109A CN113672739A CN 113672739 A CN113672739 A CN 113672739A CN 202110856109 A CN202110856109 A CN 202110856109A CN 113672739 A CN113672739 A CN 113672739A
Authority
CN
China
Prior art keywords
subject
error correction
formula
financial
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110856109.5A
Other languages
Chinese (zh)
Inventor
江琪
高翔
纪达麒
陈运文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Intelligent Shenzhen Co ltd
Original Assignee
Daguan Intelligent Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Intelligent Shenzhen Co ltd filed Critical Daguan Intelligent Shenzhen Co ltd
Priority to CN202110856109.5A priority Critical patent/CN113672739A/en
Publication of CN113672739A publication Critical patent/CN113672739A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a data extraction method of an image format financial report document, which comprises the following steps: extracting subject data; checking the extracted subject data through a configured financial and newspaper service formula; and correcting the subject data which does not accord with the subject data after the verification by adopting a plurality of error correction methods, wherein the error correction methods comprise: decimal point error correction, combination ocr candidate set error correction, combination of financial and newspaper service formulas and unmatched item error correction, neural network error correction based and ant colony algorithm based error correction.

Description

Data extraction method for image format financial and newspaper document
Technical Field
The invention belongs to the field of text processing, and particularly relates to a data extraction method for an image format financial and newspaper document.
Background
With the continuous development of new technologies such as cloud computing, big data and the like, financial requirements are vigorous, online financial services are developed rapidly, and the daily carried transaction amount of a financial information system is huge and rises at a high speed; on the other hand, the financial industry is in an environment with accelerated change and increasingly intense competition, each financial institution and internet enterprise hope to expand the business scale through business concentration and data concentration, the informatization construction of the financial industry is scheduled increasingly, meanwhile, the financial industry has a very important financial affair audit, the traditional manual input and proofreading are still used, manpower is wasted, and meanwhile, due to the diversification of templates, people with rich experience are needed for checking, so that a mature and automatic financial affair inputting and proofreading system is necessary. There are problems including:
1 diversification of the subject of financial reports
Due to the diversity of the subject and the format of the financial reports, different enterprises and different organizations have a large number of different financial report formats. And often different organizations have different expressions for subjects with uniform meanings.
2 sample quality problems lead to ocr identification problems
Because of some pdf sample quality issues, ocr has errors in identifying text, numbers. Common examples are: the number is identified as another similar number, or as a Chinese character, or some other character; identification of thousands of bits and decimal points; the Chinese characters are identified as messy codes, and the Chinese characters are not identified.
Diversification of 3 financial report sign calculation
Under the condition of introducing negative numbers into some subject numbers, the subjects themselves have the word eyes of adding/subtracting representing operation, and under the two constraint effects, the calculation mode has much flexibility.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a data extraction method of an image format financial and newspaper document.
In order to achieve the purpose, the invention adopts the following technical scheme:
a data extraction method of an image format financial report document comprises the following steps: extracting subject data; checking the extracted subject data through a configured financial and newspaper service formula; and correcting the subject data which does not accord with the subject data after the verification by adopting a plurality of error correction methods, wherein the error correction methods comprise: decimal point error correction, combination ocr candidate set error correction, combination of financial and newspaper service formulas and unmatched item error correction, neural network error correction based and ant colony algorithm based error correction.
Preferably, the subject data is stored in the form of triples, including: subject, subject time, subject amount.
Preferably, the extracting subject data includes: and matching the subject identified in the document to be extracted through the subject character strings in the extraction template so as to obtain the subject time and the subject amount corresponding to the subject.
Preferably, the synonym dictionary is configured through a domain dictionary, and the subject character strings are partially or completely replaced.
Preferably, the decimal point error correction includes: preprocessing upper and lower data; calculating the decimal proportion of all columns; the decimal point of the value is inferred from the ratio.
Preferably, the combining ocr candidate set error correction includes: the replacement check is performed to determine whether the balance is balanced by ocr identifying a high probability candidate set of subject amounts.
Preferably, the error correction combining the financial and newspaper service formula and the unmatched item comprises: preprocessing a financial and newspaper service formula and an unmatched original text; constructing all combinations of unmatched items; and solving the optimal combination according to the formula difference.
Preferably, the error correction combining the financial and newspaper service formula and the unmatched item further comprises: preprocessing a financial and newspaper service formula and a formula subject numerical value; constructing all combinations of positive and negative values of the subjects; and solving the optimal combination according to the formula difference.
Preferably, the neural network-based error correction comprises: constructing a data set according to the existing corpora; training a reasoning model according to the characteristics of subject context semantics, positions and the like; and verifying the correctness of the formula by combining the reasoning result with the amount numerical value, and applying the formula if the formula is successful.
Preferably, the ant colony algorithm-based error correction includes: based on the subject amount information and the subject information characteristics of the unmatched items, preprocessing and filtering junk characters; constructing an ant colony algorithm through the existing information; solving the optimal front k combinations according to the formula difference; calculating an editing distance according to the information of the corpus, and filtering the non-conforming combinations; and scoring and sequencing according to the similarity of all subjects to obtain an optimal solution.
Compared with the prior art, the invention has the beneficial effects that:
1, constructing a domain dictionary by using domain business knowledge for matching;
2, applying various error correction strategies by combining ocr data, a formula used for template verification and unmatched items, aiming at the problems of ocr identification problems and non-specification of financial reports, diversified financial report structures and diversified calculation modes;
and 3, aiming at the formula which cannot be balanced, obtaining a formula balancing suggested term by using an unmatched term and an algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
The embodiment provides an extraction method, and an extraction auditing system combined with multiple error correction strategies, wherein the system uses multiple extraction matching methods and multiple error correction modes, and uses a general financial report formula to correct ocr the identified problems.
The extraction error correction system is mainly divided into four parts: the system comprises an extraction module, a verification module, a rescue strategy module and a data export module.
5.1 extraction Module
The extraction module is mainly used for extracting the triples (time, subjects, amount of money, eg: 2019, business cost, 10 hundred million yuan) to facilitate the subsequent checking and rescuing process.
And (3) extraction flow:
cleaning subject data and removing some interference character items.
And secondly, extracting data by applying an extraction strategy.
Common extraction strategy
Subject directly carries out character string level comparison and matching.
And secondly, configuring a synonym dictionary by utilizing the domain dictionary.
5.2 verification Module
The module is mainly used for feeding back the extraction result through the verification result, whether the formula is balanced or not, and algebra and difference values on two sides of a formula equal sign.
5.3 rescue error correction module
Rescue error correction module includes five parts:
and (4) error correction of decimal points.
And ocr combining the candidate set for error correction.
Combining formula and unmatched item.
And fourthly, correcting errors based on the neural network.
Ant colony algorithm.
The rescue error correction strategy is mainly divided into the following parts:
5.3.1 decimal point error correction
For some fraction loss problems due to ocr recognition, context is used for error correction, including:
preprocessing upper and lower data.
And 2, calculating the decimal proportion of all the columns.
And thirdly, deducing decimal points of the numerical value according to the proportion.
5.3.2 error correction in connection with ocr the identified candidate set
The method specifically comprises the following steps:
using the non-flat formula, some subjects ocr identify a high probability candidate set of values and perform a replacement check to determine whether the balance is good.
5.3.3 combining formula with unmatched terms
5.3.3.1 combining the formula with unmatched original text and unfilled formula template subjects to correct errors
Preprocessing the unmatched original text of the formula.
② construct all combinations of unmatched items.
And thirdly, solving the optimal combination according to the formula difference.
5.3.3.2 combining with formula, the sum of subjects is corrected and balanced by positive and negative values
Preprocessing a formula and a formula subject numerical value.
② all combinations of positive and negative values of the subjects are constructed.
And thirdly, solving the optimal combination according to the formula difference.
5.3.4 error correction method based on neural network
Firstly, a data set is constructed according to the existing linguistic data.
Secondly, training a reasoning model according to the characteristics of subject context semantics, positions and the like. This model can infer subject semantics from contextual features.
And thirdly, verifying the correctness of the formula by combining the reasoning result with the amount value, and applying the formula if the formula is successful.
5.3.5 ant colony algorithm based on unmatched items
Preprocessing and filtering junk characters based on characteristics of amount information, subject information and the like of unmatched items.
Secondly, an ant colony algorithm is constructed through the existing information.
And thirdly, solving the best first k combinations according to the formula difference.
Fourthly, according to the information of the corpus, the edit distance is calculated, and the non-conforming combination is filtered.
And fifthly, scoring and sorting according to the similarity of all subjects to obtain the optimal solution.
5.4 data export Module
The module is used for generating the following data according to the verification result:
balancing formula and non-balancing formula.
② unmatched items.
Thirdly, using the unmatched items to construct a combination which accords with formula balance so as to balance the formula.
The system application scenario is as follows: in order to enter financial reports of a plurality of enterprises, the user A applies the system and automatically enters and corrects the financial reports. And the user name is automatically modified according to the content displayed by the result and is finally recorded into the wind control system of the user.
Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims (10)

1. A data extraction method for a financial document in an image format is characterized by comprising the following steps:
extracting subject data;
checking the extracted subject data through a configured financial and newspaper service formula;
and correcting the subject data which does not accord with the subject data after the verification by adopting a plurality of error correction methods, wherein the error correction methods comprise: decimal point error correction, combination ocr candidate set error correction, combination of financial and newspaper service formulas and unmatched item error correction, neural network error correction based and ant colony algorithm based error correction.
2. The method for extracting data from an image-formatted financial document according to claim 1, wherein the subject data is stored in a triplet form, including: subject, subject time, subject amount.
3. The data extraction method of an image format financial document according to claim 2, wherein said extracting subject data includes: and matching the subject identified in the document to be extracted through the subject character strings in the extraction template so as to obtain the subject time and the subject amount corresponding to the subject.
4. The method for extracting data from a financial document in image format according to claim 3, wherein a synonym dictionary is configured from the domain dictionary, and the subject character string is partially or completely replaced.
5. The method of extracting data from an image format financial document according to claim 1, wherein said decimal point error correction comprises: preprocessing upper and lower data; calculating the decimal proportion of all columns; the decimal point of the value is inferred from the ratio.
6. The method of extracting data from an image format financial document of claim 1 wherein said error correcting in combination ocr with a candidate set comprises: the replacement check is performed to determine whether the balance is balanced by ocr identifying a high probability candidate set of subject amounts.
7. The method of claim 1, wherein said error correction in combination with the financial transaction formula and the unmatched terms comprises:
preprocessing a financial and newspaper service formula and an unmatched original text; constructing all combinations of unmatched items; and solving the optimal combination according to the formula difference.
8. The method of claim 7, wherein said error correction in combination with the financial transaction formula and the unmatched terms further comprises:
preprocessing a financial and newspaper service formula and a formula subject numerical value; constructing all combinations of positive and negative values of the subjects; and solving the optimal combination according to the formula difference.
9. The method of claim 1, wherein the neural network based error correction comprises: constructing a data set according to the existing corpora; training a reasoning model according to the characteristics of subject context semantics, positions and the like; and verifying the correctness of the formula by combining the reasoning result with the amount numerical value, and applying the formula if the formula is successful.
10. The method of claim 1, wherein the ant colony algorithm-based error correction comprises: based on the subject amount information and the subject information characteristics of the unmatched items, preprocessing and filtering junk characters; constructing an ant colony algorithm through the existing information; solving the optimal front k combinations according to the formula difference; calculating an editing distance according to the information of the corpus, and filtering the non-conforming combinations; and scoring and sequencing according to the similarity of all subjects to obtain an optimal solution.
CN202110856109.5A 2021-07-28 2021-07-28 Data extraction method for image format financial and newspaper document Pending CN113672739A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110856109.5A CN113672739A (en) 2021-07-28 2021-07-28 Data extraction method for image format financial and newspaper document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110856109.5A CN113672739A (en) 2021-07-28 2021-07-28 Data extraction method for image format financial and newspaper document

Publications (1)

Publication Number Publication Date
CN113672739A true CN113672739A (en) 2021-11-19

Family

ID=78540438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110856109.5A Pending CN113672739A (en) 2021-07-28 2021-07-28 Data extraction method for image format financial and newspaper document

Country Status (1)

Country Link
CN (1) CN113672739A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023212278A1 (en) * 2022-04-28 2023-11-02 R.P. Scherer Technologies, Llc Data analysis and reporting systems and methods

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150093033A1 (en) * 2013-09-30 2015-04-02 Samsung Electronics Co., Ltd. Method, apparatus, and computer-readable recording medium for converting document image captured by using camera to dewarped document image
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN110909226A (en) * 2019-11-28 2020-03-24 达而观信息科技(上海)有限公司 Financial document information processing method and device, electronic equipment and storage medium
CN112015727A (en) * 2020-09-01 2020-12-01 民生科技有限责任公司 Automatic checking and correcting system and method for financial statement data and readable storage device
CN112036145A (en) * 2020-09-01 2020-12-04 平安国际融资租赁有限公司 Financial statement identification method and device, computer equipment and readable storage medium
US20200394431A1 (en) * 2019-06-13 2020-12-17 Wipro Limited System and method for machine translation of text
CN112668571A (en) * 2020-12-08 2021-04-16 安徽经邦软件技术有限公司 Financial statement recognition system based on artificial intelligence OCR technology
CN113094447A (en) * 2021-03-22 2021-07-09 北京三行科技有限公司 Structured information extraction method oriented to financial statement image

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150093033A1 (en) * 2013-09-30 2015-04-02 Samsung Electronics Co., Ltd. Method, apparatus, and computer-readable recording medium for converting document image captured by using camera to dewarped document image
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
US20200394431A1 (en) * 2019-06-13 2020-12-17 Wipro Limited System and method for machine translation of text
CN110909226A (en) * 2019-11-28 2020-03-24 达而观信息科技(上海)有限公司 Financial document information processing method and device, electronic equipment and storage medium
CN112015727A (en) * 2020-09-01 2020-12-01 民生科技有限责任公司 Automatic checking and correcting system and method for financial statement data and readable storage device
CN112036145A (en) * 2020-09-01 2020-12-04 平安国际融资租赁有限公司 Financial statement identification method and device, computer equipment and readable storage medium
CN112668571A (en) * 2020-12-08 2021-04-16 安徽经邦软件技术有限公司 Financial statement recognition system based on artificial intelligence OCR technology
CN113094447A (en) * 2021-03-22 2021-07-09 北京三行科技有限公司 Structured information extraction method oriented to financial statement image

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023212278A1 (en) * 2022-04-28 2023-11-02 R.P. Scherer Technologies, Llc Data analysis and reporting systems and methods

Similar Documents

Publication Publication Date Title
CN110807328B (en) Named entity identification method and system for legal document multi-strategy fusion
CN110543374B (en) Centralized data coordination using artificial intelligence mechanism
CN110046978A (en) Intelligent method of charging out
CN109886270B (en) Case element identification method for electronic file record text
CN110110335A (en) A kind of name entity recognition method based on Overlay model
CN113177124A (en) Vertical domain knowledge graph construction method and system
CN103324621B (en) A kind of Thai text spelling correcting method and device
CN110555206A (en) named entity identification method, device, equipment and storage medium
CN108563632A (en) Modification method, system, computer equipment and the storage medium of word misspelling
CN107145573A (en) The problem of artificial intelligence customer service robot, answers method and system
US20220164531A1 (en) Quality assessment method for automatic annotation of speech data
CN110941720A (en) Knowledge base-based specific personnel information error correction method
CN111899090B (en) Enterprise associated risk early warning method and system
CN108596179A (en) A kind of VAT invoice amount of money method of inspection
CN111651994B (en) Information extraction method and device, electronic equipment and storage medium
CN109408803A (en) A method of it semantic understanding for subjective item natural language and corrects
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN112307130A (en) Document-level remote supervision relation extraction method and system
CN113672739A (en) Data extraction method for image format financial and newspaper document
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN116934278A (en) Method and device for auditing construction scheme
CN107783958B (en) Target statement identification method and device
CN111950286A (en) Development method of artificial intelligent legal review engine system
CN116304023A (en) Method, system and storage medium for extracting bidding elements based on NLP technology
CN115935964A (en) Method for correcting text content of bidding document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination