CN113094447A - Structured information extraction method oriented to financial statement image - Google Patents

Structured information extraction method oriented to financial statement image Download PDF

Info

Publication number
CN113094447A
CN113094447A CN202110304028.4A CN202110304028A CN113094447A CN 113094447 A CN113094447 A CN 113094447A CN 202110304028 A CN202110304028 A CN 202110304028A CN 113094447 A CN113094447 A CN 113094447A
Authority
CN
China
Prior art keywords
financial statement
subject
data
structured information
information extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110304028.4A
Other languages
Chinese (zh)
Inventor
王博涛
李蒙阳
陈磊勇
孙亚茹
宋寒
刘建洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sanhang Technology Co ltd
Original Assignee
Beijing Sanhang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sanhang Technology Co ltd filed Critical Beijing Sanhang Technology Co ltd
Priority to CN202110304028.4A priority Critical patent/CN113094447A/en
Publication of CN113094447A publication Critical patent/CN113094447A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Abstract

The invention relates to the technical field of image processing in financial industry, in particular to a structured information extraction method for a financial statement image, which comprises the following steps: s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement; s2, adopting a data augmentation strategy to augment subjects of the standard financial statement; and S3, performing fastText model training on the augmented data. When partial errors exist in OCR analysis, such as the situation that the editing distance between subjects is small and only one word is poor, the scheme can effectively eliminate and clear the errors, can standardize the subjects with better generalization, has better robustness and is suitable for popularization and application. Particularly, for the financial statement data of a large number of websites, the financial statement data can be quickly collected and processed to obtain the subject data information with consistent subject expression and few errors.

Description

Structured information extraction method oriented to financial statement image
Technical Field
The invention relates to the technical field of image processing in financial industry, in particular to a structured information extraction method for financial statement images.
Background
The financial business is the activity of managing risks, and the management of the financial business such as investment, operation and the like to the risks is gradually transformed into vector analysis and management, and the data is the basis for realizing the quantitative management of the risks. The financial statement data is the basis of business development of financial institutions, and in the fields of investment, commissioning, wind control and the like, high-quality data can be efficiently acquired, and business advantages can be obtained. The publicly disclosed financial statement data still needs to be input manually, and high efficiency and high quality cannot be realized.
OCR is an efficient image character recognition technology algorithm and is applied to large-scale business. After the financial statement data is identified by OCR, only characters and data in an image are obtained, but structured data cannot be directly obtained, and three main problems exist: 1. the statement modes of the subjects of the financial statements of each company are inconsistent; 2. because of the uncertainty interference of the seal and the like, the OCR subject character recognition cannot ensure that all recognition is correct; 3. the existence of literal interferences such as "other", "(one)", "(description)", etc.; these uncertain factors bring great trouble to the subject standardization of financial statements.
The regular matching-based mode can only solve the problem of subject interference of a fixed format, and especially when partial errors exist in OCR analysis, the regular matching-based mode is almost invalid, and the editing distance among subjects is small, only the word difference exists, and the soft matching-based mode is not feasible. The current solution can only increase patches continuously with the increase of the use scenes, and the robustness is poor. There is a need for a more generalized subject standardization algorithm.
Disclosure of Invention
The invention provides a structured information extraction method for a financial statement image, which solves the technical problem that the text recognition scheme for the financial statement image is poor in effect.
The invention provides a structured information extraction method facing to financial statement images for solving the technical problems, which comprises the following steps:
s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement;
s2, adopting a data augmentation strategy to augment subjects of the standard financial statement;
and S3, performing fastText model training on the augmented data.
Optionally, the data augmentation policy includes randomly cutting subject characters, randomly replacing the subject characters with near-synonyms, and randomly adding the subject characters.
Optionally, the S3 specifically includes: and solving the problem of inconsistent subject expressions by using a Natural Language Processing (NLP) mode.
Optionally, the Natural Language Processing (NLP) approach comprises model selection;
the model selection comprises that aiming at the characteristics of the current task, word vectors (imbedding) are selected from bottom-layer characteristics, the dimensionality of the word vectors is set to be 50 dimensions, and the longest length of subject characters is 20; selecting a lightweight BilSTM as a model backbone network, and outputting hidden layer vector with 256 dimensionalities; and splicing the forward and backward features to form 512-dimensional feature vectors, and outputting 288-dimensional logit through the full connection layer.
Optionally, the Natural Language Processing (NLP) mode includes a loss function, where the loss function is composed of two parts, one part is normal class cross entropy loss, and the other part takes into account a character length relationship between different subjects and a standard subject, and adds an edit distance loss, where the edit distance loss is:
Figure BDA0002987395310000031
288, y is a real tag, p is a predicted tag, β is a weighting factor, a and b are character strings of the real tag and the predicted tag, i and j are character string lengths corresponding to a and b, and lev is a character string edit distance function.
Optionally, the string edit distance function is implemented by a python-Levenshtein function interface.
Optionally, the financial statement is a PDF format file.
Optionally, the S1 specifically includes: the financial statement in the PDF format is segmented to obtain a plurality of independent original statements, all the original statements are spliced to obtain a complete statement, and finally the complete statement is subjected to form reconstruction through an ocr identification method to standardize subjects and obtain a standard financial statement.
Optionally, the S2 specifically includes: and (8) performing error replacement on the standard financial statement identified by ocr, specifically, randomly replacing the form and the near word by searching.
Optionally, the S2 specifically includes: and (6) performing missed checking on the standard financial statement identified by ocr, specifically, reducing the head and tail or the middle of the subject name.
Has the advantages that: the invention provides a structured information extraction method facing to financial statement images, which comprises the following steps: s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement; s2, adopting a data augmentation strategy to augment subjects of the standard financial statement; and S3, performing fastText model training on the augmented data. When partial errors exist in OCR analysis, such as the situation that the editing distance between subjects is small and only one word is poor, the scheme can effectively eliminate and clear the errors, can standardize the subjects with better generalization, has better robustness and is suitable for popularization and application. Particularly, for the financial statement data of a large number of websites, the financial statement data can be quickly collected and processed to obtain the subject data information with consistent subject expression and few errors.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings. The detailed description of the present invention is given in detail by the following examples and the accompanying drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic flow chart of the structured information extraction method for financial statement images according to the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention. The invention is described in more detail in the following paragraphs by way of example with reference to the accompanying drawings. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.
It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When a component is referred to as being "connected" to another component, it can be directly connected to the other component or intervening components may also be present. When a component is referred to as being "disposed on" another component, it can be directly on the other component or intervening components may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
As shown in FIG. 1, the invention provides a structured information extraction method facing to financial statement images, which comprises the following steps:
and S1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement. Specifically, all subject items of the financial statement are collected firstly, standardized subject categories are manually arranged, standardized subject category labeling is carried out on subject data, and the standard financial statement is obtained. Namely a data preparation stage, in a specific implementation scene, the used data come from websites of a clearinghouse and a deep-hand exchange, and the crawler technology is utilized to crawl financial statement data of a data company from the websites, so that 55000 surplus statement data are collected. Then, the standardized subject categories are manually arranged. Because the subjects of the financial report data of each family have the condition that the categories are consistent but the expression modes are inconsistent, the conventional rule-based method cannot align the subjects. The invention firstly carries out manual arrangement of subject categories on the collected data, and classifies the profit statement, the cash flow statement and the asset liability statement into 288 standard categories. The standard class classification case was performed as shown in the following table:
Figure BDA0002987395310000061
and then carrying out standardized category labeling on the existing financial statement subject data. Randomly drawing 100 financial statements, wherein each statement comprises the following three tables: balance sheet, profit sheet, cash flow sheet, 300 items of subject form data in total. 300 pages of subject data are marked manually, and the labels are 288 corresponding subjects.
And S2, adopting a data augmentation strategy to augment subjects of the standard financial statement. Because the project is a task after OCR recognition, the OCR can be recognized wrongly due to factors such as seal interference, scanning blurring and the like, so that subsequent subject extraction is influenced, and the interference caused by wrong words and few words is resisted by simulating that the OCR is wrongly. Further, there are two methods of amplification: firstly, randomly replacing a shape near word; and secondly, randomly cutting head characters and tail characters (generally 1-2 characters).
And S3, performing fastText model training on the augmented data. The problems of inconsistent expression of subjects, wrongly written characters, few characters and the like are solved.
Optionally, the data augmentation policy includes randomly cutting subject characters, randomly replacing the subject characters with near-synonyms, and randomly adding the subject characters. When the financial statement is collected, OCR recognition processing is performed, and at this time, OCR may perform recognition errors due to factors such as stamp interference and scanning blur, thereby affecting subsequent subject extraction. Therefore, the false recognition by simulating the OCR can further reduce the interference caused by wrong words and few words.
The optional scheme solves the problem of inconsistent subject expressions by using a Natural Language Processing (NLP) mode. And solving the problem of inconsistent subject expressions by using an NLP (non-line-of-sight) mode. Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Model selection and loss functions are included.
Aiming at the characteristics of the current task, a word vector embedding is selected from bottom-layer characteristics, the dimensionality of the word vector is set to be 50 dimensions, and the longest length of a subject character is 20. And selecting a lightweight BilSTM as a model backbone network, and outputting hidden layer vector with 256 dimensionalities. The forward and backward feature concatenations form a 512-dimensional feature vector. 288 dimensions are output via the full link layer.
LSTM is known collectively as Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features. BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.
The loss function is composed of two parts, one part is normal class cross entropy loss, the other part considers the character length relation of different subjects and standard subjects, and adds editing distance loss:
Figure BDA0002987395310000081
288, y is a real tag, p is a predicted tag, β is a weighting factor, a and b are character strings of the real tag and the predicted tag, i and j are character string lengths corresponding to a and b, and lev is a character string edit distance function.
The character string editing distance function is realized through a python-Levenshtein function interface. The python-Levenshtein function interface is the prior art, and is more efficient and convenient to be directly applied to the function interface through calling.
Has the advantages that: the invention provides a structured information extraction method facing to financial statement images, which comprises the following steps: s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement; s2, adopting a data augmentation strategy to augment subjects of the standard financial statement; and S3, performing fastText model training on the augmented data. When partial errors exist in OCR analysis, such as the situation that the editing distance between subjects is small and only one word is poor, the scheme can effectively eliminate and clear the errors, can standardize the subjects with better generalization, has better robustness and is suitable for popularization and application. Particularly, for the financial statement data of a large number of websites, the financial statement data can be quickly collected and processed to obtain the subject data information with consistent subject expression and few errors.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A structured information extraction method facing financial statement images is characterized by comprising the following steps:
s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement;
s2, adopting a data augmentation strategy to augment subjects of the standard financial statement;
and S3, performing fastText model training on the augmented data.
2. The financial statement image-oriented structured information extraction method of claim 1, wherein the data augmentation policy includes randomly cropping subject text, randomly replacing subject text with synonyms, and randomly adding subject text.
3. The financial statement image-oriented structured information extraction method according to claim 1, wherein the S3 specifically comprises: and solving the problem of inconsistent subject expressions by using a Natural Language Processing (NLP) mode.
4. A financial statement image-oriented structured information extraction method according to claim 3, characterized in that said Natural Language Processing (NLP) means comprises model selection;
the model selection comprises that aiming at the characteristics of the current task, word vectors (imbedding) are selected from bottom-layer characteristics, the dimensionality of the word vectors is set to be 50 dimensions, and the longest length of subject characters is 20; selecting a lightweight BilSTM as a model backbone network, and outputting hidden layer vector with 256 dimensionalities; and splicing the forward and backward features to form 512-dimensional feature vectors, and outputting 288-dimensional logit through the full connection layer.
5. The financial statement image-oriented structured information extraction method according to claim 4, wherein the Natural Language Processing (NLP) manner comprises a loss function, the loss function is composed of two parts, one part is normal class cross entropy loss, the other part takes into account the character length relationship between different subjects and standard subjects, and adds an edit distance loss, the edit distance loss is:
Figure FDA0002987395300000011
288, y is a real tag, p is a predicted tag, β is a weighting factor, a and b are character strings of the real tag and the predicted tag, i and j are character string lengths corresponding to a and b, and lev is a character string edit distance function.
6. The financial statement image-oriented structured information extraction method according to claim 5, wherein the string edit distance function is implemented by a python-Levenshtein function interface.
7. A financial statement image-oriented structured information extraction method as claimed in claim 1, wherein said financial statement is a PDF formatted file.
8. The financial statement image-oriented structured information extraction method according to claim 7, wherein the S1 specifically comprises: the financial statement in the PDF format is segmented to obtain a plurality of independent original statements, all the original statements are spliced to obtain a complete statement, and finally the complete statement is subjected to form reconstruction through an ocr identification method to standardize subjects and obtain a standard financial statement.
9. The method for extracting structured information of a financial statement image as claimed in claim 8, wherein said S2 specifically comprises: and (8) performing error replacement on the standard financial statement identified by ocr, specifically, randomly replacing the form and the near word by searching.
10. The method for extracting structured information of a financial statement image as claimed in claim 8, wherein said S2 specifically comprises: and (6) performing missed checking on the standard financial statement identified by ocr, specifically, reducing the head and tail or the middle of the subject name.
CN202110304028.4A 2021-03-22 2021-03-22 Structured information extraction method oriented to financial statement image Pending CN113094447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110304028.4A CN113094447A (en) 2021-03-22 2021-03-22 Structured information extraction method oriented to financial statement image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110304028.4A CN113094447A (en) 2021-03-22 2021-03-22 Structured information extraction method oriented to financial statement image

Publications (1)

Publication Number Publication Date
CN113094447A true CN113094447A (en) 2021-07-09

Family

ID=76669023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110304028.4A Pending CN113094447A (en) 2021-03-22 2021-03-22 Structured information extraction method oriented to financial statement image

Country Status (1)

Country Link
CN (1) CN113094447A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672739A (en) * 2021-07-28 2021-11-19 达而观智能(深圳)有限公司 Data extraction method for image format financial and newspaper document

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909226A (en) * 2019-11-28 2020-03-24 达而观信息科技(上海)有限公司 Financial document information processing method and device, electronic equipment and storage medium
CN112036145A (en) * 2020-09-01 2020-12-04 平安国际融资租赁有限公司 Financial statement identification method and device, computer equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909226A (en) * 2019-11-28 2020-03-24 达而观信息科技(上海)有限公司 Financial document information processing method and device, electronic equipment and storage medium
CN112036145A (en) * 2020-09-01 2020-12-04 平安国际融资租赁有限公司 Financial statement identification method and device, computer equipment and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672739A (en) * 2021-07-28 2021-11-19 达而观智能(深圳)有限公司 Data extraction method for image format financial and newspaper document

Similar Documents

Publication Publication Date Title
CN111985229B (en) Sequence labeling method and device and computer equipment
CN111767732B (en) Document content understanding method and system based on graph attention model
CN113177124A (en) Vertical domain knowledge graph construction method and system
CN113312108B (en) SWIFT message verification method and device, electronic equipment and storage medium
CN116049397B (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
WO2023071745A1 (en) Information labeling method, model training method, electronic device and storage medium
CN110245349A (en) A kind of syntax dependency parsing method, apparatus and a kind of electronic equipment
US20220076109A1 (en) System for contextual and positional parameterized record building
Kuang et al. Visual information extraction in the wild: practical dataset and end-to-end solution
CN113094447A (en) Structured information extraction method oriented to financial statement image
CN115130437B (en) Intelligent document filling method and device and storage medium
US20220335335A1 (en) Method and system for identifying mislabeled data samples using adversarial attacks
CN114818718A (en) Contract text recognition method and device
CN113094446A (en) Subject information extraction method oriented to financial statement image
CN111046934B (en) SWIFT message soft clause recognition method and device
CN114239576A (en) Issue label classification method based on topic model and convolutional neural network
CN114356924A (en) Method and apparatus for extracting data from structured documents
Tamrin et al. Simultaneous detection of regular patterns in ancient manuscripts using GAN-Based deep unsupervised segmentation
CN117332761B (en) PDF document intelligent identification marking system
CN117332180B (en) Method, equipment and storage medium for intelligent writing of research report based on large language model
CN112651246B (en) Service demand conflict detection method integrating deep learning and workflow modes
CN115391569B (en) Method for automatically constructing industry chain map from research report and related equipment
US20240152699A1 (en) Enhanced named entity recognition (ner) using custombuilt regular expression (regex) matcher and heuristic entity ruler
US11783605B1 (en) Generalizable key-value set extraction from documents using machine learning models
CN116523032B (en) Image text double-end migration attack method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination