CN113094447A

CN113094447A - Structured information extraction method oriented to financial statement image

Info

Publication number: CN113094447A
Application number: CN202110304028.4A
Authority: CN
Inventors: 王博涛; 李蒙阳; 陈磊勇; 孙亚茹; 宋寒; 刘建洋
Original assignee: Beijing Sanhang Technology Co ltd
Current assignee: Beijing Sanhang Technology Co ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-07-09

Abstract

The invention relates to the technical field of image processing in financial industry, in particular to a structured information extraction method for a financial statement image, which comprises the following steps: s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement; s2, adopting a data augmentation strategy to augment subjects of the standard financial statement; and S3, performing fastText model training on the augmented data. When partial errors exist in OCR analysis, such as the situation that the editing distance between subjects is small and only one word is poor, the scheme can effectively eliminate and clear the errors, can standardize the subjects with better generalization, has better robustness and is suitable for popularization and application. Particularly, for the financial statement data of a large number of websites, the financial statement data can be quickly collected and processed to obtain the subject data information with consistent subject expression and few errors.

Description

Structured information extraction method oriented to financial statement image

Technical Field

The invention relates to the technical field of image processing in financial industry, in particular to a structured information extraction method for financial statement images.

Background

The financial business is the activity of managing risks, and the management of the financial business such as investment, operation and the like to the risks is gradually transformed into vector analysis and management, and the data is the basis for realizing the quantitative management of the risks. The financial statement data is the basis of business development of financial institutions, and in the fields of investment, commissioning, wind control and the like, high-quality data can be efficiently acquired, and business advantages can be obtained. The publicly disclosed financial statement data still needs to be input manually, and high efficiency and high quality cannot be realized.

OCR is an efficient image character recognition technology algorithm and is applied to large-scale business. After the financial statement data is identified by OCR, only characters and data in an image are obtained, but structured data cannot be directly obtained, and three main problems exist: 1. the statement modes of the subjects of the financial statements of each company are inconsistent; 2. because of the uncertainty interference of the seal and the like, the OCR subject character recognition cannot ensure that all recognition is correct; 3. the existence of literal interferences such as "other", "(one)", "(description)", etc.; these uncertain factors bring great trouble to the subject standardization of financial statements.

The regular matching-based mode can only solve the problem of subject interference of a fixed format, and especially when partial errors exist in OCR analysis, the regular matching-based mode is almost invalid, and the editing distance among subjects is small, only the word difference exists, and the soft matching-based mode is not feasible. The current solution can only increase patches continuously with the increase of the use scenes, and the robustness is poor. There is a need for a more generalized subject standardization algorithm.

Disclosure of Invention

The invention provides a structured information extraction method for a financial statement image, which solves the technical problem that the text recognition scheme for the financial statement image is poor in effect.

The invention provides a structured information extraction method facing to financial statement images for solving the technical problems, which comprises the following steps:

s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement;

s2, adopting a data augmentation strategy to augment subjects of the standard financial statement;

and S3, performing fastText model training on the augmented data.

Optionally, the data augmentation policy includes randomly cutting subject characters, randomly replacing the subject characters with near-synonyms, and randomly adding the subject characters.

Optionally, the S3 specifically includes: and solving the problem of inconsistent subject expressions by using a Natural Language Processing (NLP) mode.

Optionally, the Natural Language Processing (NLP) approach comprises model selection;

the model selection comprises that aiming at the characteristics of the current task, word vectors (imbedding) are selected from bottom-layer characteristics, the dimensionality of the word vectors is set to be 50 dimensions, and the longest length of subject characters is 20; selecting a lightweight BilSTM as a model backbone network, and outputting hidden layer vector with 256 dimensionalities; and splicing the forward and backward features to form 512-dimensional feature vectors, and outputting 288-dimensional logit through the full connection layer.

Optionally, the Natural Language Processing (NLP) mode includes a loss function, where the loss function is composed of two parts, one part is normal class cross entropy loss, and the other part takes into account a character length relationship between different subjects and a standard subject, and adds an edit distance loss, where the edit distance loss is:

288, y is a real tag, p is a predicted tag, β is a weighting factor, a and b are character strings of the real tag and the predicted tag, i and j are character string lengths corresponding to a and b, and lev is a character string edit distance function.

Optionally, the string edit distance function is implemented by a python-Levenshtein function interface.

Optionally, the financial statement is a PDF format file.

Optionally, the S1 specifically includes: the financial statement in the PDF format is segmented to obtain a plurality of independent original statements, all the original statements are spliced to obtain a complete statement, and finally the complete statement is subjected to form reconstruction through an ocr identification method to standardize subjects and obtain a standard financial statement.

Optionally, the S2 specifically includes: and (8) performing error replacement on the standard financial statement identified by ocr, specifically, randomly replacing the form and the near word by searching.

Optionally, the S2 specifically includes: and (6) performing missed checking on the standard financial statement identified by ocr, specifically, reducing the head and tail or the middle of the subject name.

Has the advantages that: the invention provides a structured information extraction method facing to financial statement images, which comprises the following steps: s1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement; s2, adopting a data augmentation strategy to augment subjects of the standard financial statement; and S3, performing fastText model training on the augmented data. When partial errors exist in OCR analysis, such as the situation that the editing distance between subjects is small and only one word is poor, the scheme can effectively eliminate and clear the errors, can standardize the subjects with better generalization, has better robustness and is suitable for popularization and application. Particularly, for the financial statement data of a large number of websites, the financial statement data can be quickly collected and processed to obtain the subject data information with consistent subject expression and few errors.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings. The detailed description of the present invention is given in detail by the following examples and the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart of the structured information extraction method for financial statement images according to the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention. The invention is described in more detail in the following paragraphs by way of example with reference to the accompanying drawings. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When a component is referred to as being "connected" to another component, it can be directly connected to the other component or intervening components may also be present. When a component is referred to as being "disposed on" another component, it can be directly on the other component or intervening components may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

As shown in FIG. 1, the invention provides a structured information extraction method facing to financial statement images, which comprises the following steps:

and S1, collecting all subject data of the financial statement, and carrying out standardized subject category marking on the subject data to obtain a standard financial statement. Specifically, all subject items of the financial statement are collected firstly, standardized subject categories are manually arranged, standardized subject category labeling is carried out on subject data, and the standard financial statement is obtained. Namely a data preparation stage, in a specific implementation scene, the used data come from websites of a clearinghouse and a deep-hand exchange, and the crawler technology is utilized to crawl financial statement data of a data company from the websites, so that 55000 surplus statement data are collected. Then, the standardized subject categories are manually arranged. Because the subjects of the financial report data of each family have the condition that the categories are consistent but the expression modes are inconsistent, the conventional rule-based method cannot align the subjects. The invention firstly carries out manual arrangement of subject categories on the collected data, and classifies the profit statement, the cash flow statement and the asset liability statement into 288 standard categories. The standard class classification case was performed as shown in the following table:

and then carrying out standardized category labeling on the existing financial statement subject data. Randomly drawing 100 financial statements, wherein each statement comprises the following three tables: balance sheet, profit sheet, cash flow sheet, 300 items of subject form data in total. 300 pages of subject data are marked manually, and the labels are 288 corresponding subjects.

And S2, adopting a data augmentation strategy to augment subjects of the standard financial statement. Because the project is a task after OCR recognition, the OCR can be recognized wrongly due to factors such as seal interference, scanning blurring and the like, so that subsequent subject extraction is influenced, and the interference caused by wrong words and few words is resisted by simulating that the OCR is wrongly. Further, there are two methods of amplification: firstly, randomly replacing a shape near word; and secondly, randomly cutting head characters and tail characters (generally 1-2 characters).

And S3, performing fastText model training on the augmented data. The problems of inconsistent expression of subjects, wrongly written characters, few characters and the like are solved.

Optionally, the data augmentation policy includes randomly cutting subject characters, randomly replacing the subject characters with near-synonyms, and randomly adding the subject characters. When the financial statement is collected, OCR recognition processing is performed, and at this time, OCR may perform recognition errors due to factors such as stamp interference and scanning blur, thereby affecting subsequent subject extraction. Therefore, the false recognition by simulating the OCR can further reduce the interference caused by wrong words and few words.

The optional scheme solves the problem of inconsistent subject expressions by using a Natural Language Processing (NLP) mode. And solving the problem of inconsistent subject expressions by using an NLP (non-line-of-sight) mode. Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Model selection and loss functions are included.

Aiming at the characteristics of the current task, a word vector embedding is selected from bottom-layer characteristics, the dimensionality of the word vector is set to be 50 dimensions, and the longest length of a subject character is 20. And selecting a lightweight BilSTM as a model backbone network, and outputting hidden layer vector with 256 dimensionalities. The forward and backward feature concatenations form a 512-dimensional feature vector. 288 dimensions are output via the full link layer.

LSTM is known collectively as Long Short-Term Memory, which is one of RNN (Current Neural network). LSTM is well suited for modeling time series data, such as text data, due to its design features. BilSTM is an abbreviation of Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM and backward LSTM. Both are often used to model context information in natural language processing tasks.

The loss function is composed of two parts, one part is normal class cross entropy loss, the other part considers the character length relation of different subjects and standard subjects, and adds editing distance loss:

The character string editing distance function is realized through a python-Levenshtein function interface. The python-Levenshtein function interface is the prior art, and is more efficient and convenient to be directly applied to the function interface through calling.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A structured information extraction method facing financial statement images is characterized by comprising the following steps:

and S3, performing fastText model training on the augmented data.

2. The financial statement image-oriented structured information extraction method of claim 1, wherein the data augmentation policy includes randomly cropping subject text, randomly replacing subject text with synonyms, and randomly adding subject text.

3. The financial statement image-oriented structured information extraction method according to claim 1, wherein the S3 specifically comprises: and solving the problem of inconsistent subject expressions by using a Natural Language Processing (NLP) mode.

4. A financial statement image-oriented structured information extraction method according to claim 3, characterized in that said Natural Language Processing (NLP) means comprises model selection;

5. The financial statement image-oriented structured information extraction method according to claim 4, wherein the Natural Language Processing (NLP) manner comprises a loss function, the loss function is composed of two parts, one part is normal class cross entropy loss, the other part takes into account the character length relationship between different subjects and standard subjects, and adds an edit distance loss, the edit distance loss is:

6. The financial statement image-oriented structured information extraction method according to claim 5, wherein the string edit distance function is implemented by a python-Levenshtein function interface.

7. A financial statement image-oriented structured information extraction method as claimed in claim 1, wherein said financial statement is a PDF formatted file.

8. The financial statement image-oriented structured information extraction method according to claim 7, wherein the S1 specifically comprises: the financial statement in the PDF format is segmented to obtain a plurality of independent original statements, all the original statements are spliced to obtain a complete statement, and finally the complete statement is subjected to form reconstruction through an ocr identification method to standardize subjects and obtain a standard financial statement.

9. The method for extracting structured information of a financial statement image as claimed in claim 8, wherein said S2 specifically comprises: and (8) performing error replacement on the standard financial statement identified by ocr, specifically, randomly replacing the form and the near word by searching.

10. The method for extracting structured information of a financial statement image as claimed in claim 8, wherein said S2 specifically comprises: and (6) performing missed checking on the standard financial statement identified by ocr, specifically, reducing the head and tail or the middle of the subject name.