CN114691919A

CN114691919A - Text format auditing module for financial long text rechecking system

Info

Publication number: CN114691919A
Application number: CN202210350155.2A
Authority: CN
Inventors: 马文翔; 朱乐为; 崔子锋
Original assignee: Guangzhou Guxin Intelligent Technology Co ltd
Current assignee: Guangzhou Guxin Intelligent Technology Co ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-01

Abstract

A text format auditing module for a financial long text rechecking system is used for rechecking forms, catalogs and other formats of a financial long text and judging the continuous readability of the financial long text, and is characterized in that: the module comprises a preprocessing module, a serial number acquisition module, a serial number matching module, a continuity checking module, a reference checking module and an output module. By the cooperation of the modules, format auditing of the financial long text can be quickly carried out, based on a natural language processing technology, an auditing structure is directly output by artificial intelligence, the auditing time is greatly shortened, the auditing efficiency is high, the auditing accuracy is high, the phenomenon that a worker forgets to modify the serial number of a paragraph due to deletion or addition of some paragraphs when modifying a document is prevented, so that a directory is discontinuous or a cited directory does not exist, the system mainly detects format errors of the part and prompts a user, and the continuous readability of the document is ensured.

Description

Text format auditing module for financial long text rechecking system

Technical Field

The invention belongs to the field of financial text analysis, and particularly relates to a text format auditing module for a financial long text review system.

Background

The financial long text mainly refers to a series of financial data texts including annual reports, stock instructions, audit reports and the like for recording and evaluating the financial condition of an enterprise for a period of time or displaying the operation condition of the enterprise outwards. Such text is usually written by a person in professional finance in combination with the actual situation of the company, and the main text comprises a large amount of unstructured financial data, mainly composed of text paragraphs, financial indexes, table data and the like. Often financial institutions or enterprises need to check the long financial text before issuing the long financial text to ensure the correctness of the text, most of financial text rechecking work is performed manually, the financial text data is complicated, the professional degree requirement is high, the efficiency and the accuracy of manual auditing are low, the auditing work task is very heavy, a large amount of time is consumed for performing repetitive work, in this case, auditors are easy to fatigue, and omission and errors can occur in the checking process. Therefore, it is a great trend to gradually adopt artificial intelligence compound texts instead of the artificial compound texts. The financial text rechecking needs to check the accuracy of text content data and the correctness of characters, and also needs to check whether the format of the whole financial text structure is compliant, whether the text title has continuity and the like, wherein the format audit of the text needs to run through the whole text, the checking content is more, and the omission condition is easy to occur. In order to improve the auditing efficiency and precision of the financial long text format, combine with the powerful functions of artificial intelligence on the natural language processing technology and rely on the powerful computing power of a computer, the text format auditing module of the financial long text rechecking system is provided so as to solve the problems of low auditing efficiency and low precision of the financial long text format and save a large amount of auditing time.

Disclosure of Invention

Aiming at the problems of the existing phenomena, the invention provides a text format auditing module for a financial long text rechecking system, which is used for realizing accurate and efficient format auditing of a financial long text by analyzing and processing the text content and structure of the financial long text and combining artificial intelligence to carry out comparative analysis on the whole structure and continuity of the text.

In order to achieve the above object, the present invention provides a text format auditing module for a financial long text review system, which is used for auditing the catalog and title formats of a financial long text and judging the continuity of the financial long text. The text format auditing module comprises a preprocessing module, a serial number acquiring module, a serial number matching module, a continuity auditing module, a reference auditing module and an output module. The system comprises a preprocessing module, a data processing module and a data processing module, wherein the preprocessing module is used for dividing text data of an input financial long text, analyzing the text data according to an NLP (non line segment) model to obtain paragraph data, title data, table data and data relations among the paragraph data, the title data and the table data, reading a text directory structure and carrying out hierarchical formatting on the title data according to the text directory structure; the preprocessing module organizes and outputs the analyzed paragraph data, the analyzed title data and the analyzed table data according to a predefined data model and stores the data into a database. The sequence number acquisition module is used for acquiring the title sequence numbers and the sequence number formats and classifying the sequence numbers, wherein the categories of the title sequence numbers comprise directory title sequence numbers, text title sequence numbers and appendix table title sequence numbers. And the sequence number matching module is used for matching and storing the sequence number acquired by the sequence number acquisition module with the characteristic words before and after the sequence number. The continuity checking module is used for checking the directory title serial number, the text title serial number and the appendix table title serial number which are acquired by the serial number acquisition module, judging the serial numbers as continuity serial numbers or non-continuity serial numbers, and simultaneously checking the format of the title data and the format of the corresponding serial number. And the reference checking module is used for checking the reference accuracy of the serial numbers referenced in the text data according to the relation between the serial numbers output by the serial number matching module and the characteristic words. And the output module outputs and stores the auditing results of the continuity auditing module and the citation auditing module.

Preferably, the preprocessing module further includes a text conversion unit, configured to convert the financial long text in the PDF format into a picture format text, stretch and binarize the picture format text, and then obtain text data according to a CV model.

Preferably, the continuity check module includes a comparing unit, configured to compare continuity of sequence numbers of adjacent titles with the same format, and determine whether the sequence number is a continuous sequence number or a discontinuous sequence number.

Preferably, the output module comprises a display unit and a marking unit, and the display unit is used for displaying the auditing result in a labeling form; the marking unit is used for marking sequence number data or title data of the text with problems and displaying the sequence number data or the title data in the display unit.

Preferably, the NLP model is a pre-training model obtained by training large-scale general corpus and financial corpus.

Preferably, the text format auditing module processes the financial long text by: s1: inputting a PDF financial long text; s2: preprocessing the financial long text by using an NLP model, and storing the processed data in a preset format; s3: acquiring a title serial number and a serial number format, and classifying the serial numbers; s4: matching and storing the serial number with the characteristic words before and after the serial number; s5: checking the continuity of the title serial numbers, the title formats and the corresponding serial number formats; s6: checking whether the sequence numbers quoted in the text data are accurate or not; s7: and outputting and saving the auditing results in the steps of S5 and S6 in a predefined format.

Preferably, the specific steps of using the NLP model to preprocess the financial long text are as follows: s20: converting the PDF format financial long text into a picture format text, detecting the picture format text according to a CV model to acquire table, header, footer, picture and formula data, and extracting and organizing character data except the table, the header, the footer, the picture and the formula data; s21: dividing the text data into paragraph data and header data according to an NLP model, and acquiring form data in the paragraph data; s22: carrying out data cleaning, data length cutting and data extraction position positioning on the paragraph data and the header data, analyzing the processed paragraph data and the header data according to an NLP model, extracting a data relation, and outputting and storing according to a predefined data model; s23: carrying out data cleaning and set division on the table data, analyzing the processed table data according to the NLP model, extracting a data relation, and outputting and storing according to a predefined data model; s24: acquiring a text directory structure; s25: the title data of S23 is read and title level formatting is performed.

Preferably, the title level formatting method is as follows: s250: determining title leading relation among title data; s251: determining a title level according to the title leading relationship; s252: different hierarchical titles are formatted.

The invention has the beneficial effects that: the invention provides a text format auditing module for a financial long text rechecking system, which cleans, cuts and classifies a text to be detected through a preprocessing module and extracts text data, obtains paragraph data, title data, table data and corresponding data relation in a preset format by using NLP model analysis processing, processes the hierarchical format of a title, obtains the serial number and the serial number format by using a serial number obtaining module, and associates the serial number with a related characteristic word by using a serial number matching module; the continuity auditing module carries out continuity auditing aiming at the acquired serial number so as to judge the continuity of the title and simultaneously audits the title format and the corresponding serial number format; and auditing the sequence numbers quoted in the text data by using a quote auditing module, judging the quote accuracy through the feature words matched with the sequence numbers, and finally outputting the audit result by using an output module. By the cooperation of the modules, format audit can be rapidly performed on the financial long text, based on a natural language processing technology, an audit structure is directly output by artificial intelligence, the time required by audit is greatly shortened, the audit efficiency is high, the audit accuracy is high, the audit result is clear, personnel can conveniently and rapidly find out problem points and correct the problem points, and the continuous legibility of documents is ensured.

Drawings

FIG. 1 is a block diagram of a text format auditing module of a financial long text review system according to the present invention;

fig. 2 is a flow chart of the step of auditing the financial long text by the text format auditing module provided by the present invention.

Fig. 3 is a flowchart of the preprocessing S2 provided by the present invention.

Fig. 4 is a flowchart of the title level formatting S25 provided by the present invention.

Detailed Description

To describe the present invention in further detail, reference is now made to the following descriptions taken in conjunction with the accompanying drawing. It is to be noted that the embodiments described below are only a part of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a text format auditing module for a financial long text review system, which is used for auditing the catalog and title formats of a financial long text and determining the continuity of the financial long text. The text format auditing module comprises a preprocessing module, a serial number acquiring module, a serial number matching module, a continuity auditing module, a quoting auditing module and an output module. Wherein:

the preprocessing module is used for dividing text data of an input financial long text, analyzing the text data according to an NLP (line segment language) model to obtain paragraph data, title data, table data and data relations among the paragraph data, the title data and the table data, reading a text directory structure and carrying out hierarchical formatting on the title data according to the text directory structure; the preprocessing module organizes and outputs the analyzed paragraph data, the analyzed title data and the analyzed table data according to a predefined data model and stores the data into a database.

The sequence number acquisition module is used for acquiring the title sequence numbers and the sequence number formats and classifying the sequence numbers, wherein the categories of the title sequence numbers comprise directory title sequence numbers, text title sequence numbers and appendix table title sequence numbers.

And the sequence number matching module is used for matching and storing the sequence number acquired by the sequence number acquisition module with the characteristic words before and after the sequence number.

The continuity checking module is used for checking the directory title serial number, the text title serial number and the appendix table title serial number which are acquired by the serial number acquisition module, judging the serial numbers as continuity serial numbers or non-continuity serial numbers, and simultaneously checking the format of the title data and the format of the corresponding serial number.

And the reference checking module is used for checking the reference accuracy of the serial numbers referenced in the text data according to the relation between the serial numbers output by the serial number matching module and the characteristic words.

And the output module outputs and stores the auditing results of the continuity auditing module and the citation auditing module.

Preferably, the NLP model is a pre-training model obtained by training a large-scale general corpus and a financial corpus.

The text format auditing module for the financial long text review system can process text data based on the NLP model, extract key information such as title data and the like through a natural language processing technology, and judge whether the title in the text has continuity or not by judging the continuity of the serial number corresponding to the title. And (5) for auditing the quoting part, associating the serial number with the characteristic word, and judging whether the quoting part in the text is accurate or not by using the serial number as a link. The NLP model is used for analyzing the text, and the complicated data, characters and tables in the financial long text can be effectively cleaned and sorted. Financial texts are usually described by adopting natural language, and writing requirements and text formats of financial long texts produced by enterprises and financial institutions are different, so that financial long texts of different templates need to be converted into a file format predefined by an auditing module before auditing the format of the financial text, and the requirement is realized by adopting natural language processing technology. Firstly, cleaning text data, removing abnormal values, missing values or characters without semantics, and cutting and dividing data length; the method comprises the steps of analyzing and extracting processed data by utilizing a pre-trained NLP model, extracting characters, pictures and table data and carrying out fusion classification on the characters, the pictures and the table data through large-scale general corpus and financial corpus training of the model, wherein the characters, the pictures and the table data are divided into paragraph data, title data and table data, the NLP model can identify semantic information of characters rich in context so as to extract data relation, a text directory structure is read, and the title data is stored in a predefined format after being formatted in a hierarchy mode and is used for a format auditing module to carry out further auditing operation.

Format auditing of the financial text is divided into continuity auditing and title format auditing, so that continuity and readability of the text are judged, and the phenomenon that a catalog is discontinuous or citation does not exist due to the fact that a person forgets to modify a title or a serial number after deleting or adding a field is avoided. The title serial number and the serial number format can be acquired through the serial number acquisition module and the serial number matching module, the serial number is associated with the feature words, and preparation is made for subsequent continuity examination and reference examination. The continuity checking module mainly checks the continuity of the format and the serial number of the title so as to judge the continuity of the title. The sequence numbers are sequential, the continuity of the sequence numbers refers to the sequence of the sequence numbers, the sequence is incorrect, namely the sequence numbers are discontinuous, otherwise, the sequence numbers are continuous, and the sequence number format is also judged before the continuity judgment of the sequence numbers, because the text titles have the classification condition, and the corresponding formats of the title sequence numbers of different levels are inconsistent, the continuity of the sequence numbers is judged according to the sequence numbers of the same format. For the reference in the text, the serial number is matched and associated with the front and rear feature words, so that whether the content referenced by the serial number at the reference is associated with the content at the reference can be judged, and the accuracy of the reference is judged. And finally, outputting the result of the checking judgment in a specific format, so that the staff can understand the result intuitively.

Preferably, the specific steps of preprocessing the finance long text by using the NLP model are as follows: s20: converting the PDF format financial long text into a picture format text, detecting the picture format text according to a CV model to acquire table, header, footer, picture and formula data, and extracting and organizing character data except the table, the header, the footer, the picture and the formula data; s21: dividing the text data into paragraph data and title data according to an NLP model, and acquiring form data in the paragraph data; s22: the paragraph data and the title data are subjected to data cleaning, data length cutting and data extraction position positioning, the paragraph data and the title data after being analyzed and processed according to the NLP model are output and stored according to a predefined data model after data relation is extracted; s23: carrying out data cleaning and set division on the table data, analyzing the processed table data according to the NLP model, extracting a data relation, and outputting and storing according to a predefined data model; s24: acquiring a text directory structure; s25: the title data of S23 is read and title level formatting is performed.

Preferably, the title level formatting method is as follows: s250: determining title leading relation among title data; s251: determining a title level according to the title leading relationship; s252: different hierarchical titles are formatted. Since the titles in the document have a hierarchical relationship therebetween, the hierarchical relationship may include a subordinate relationship and a parallel relationship between the titles, and the like. The dependency relationship means that one title (upper title) summarizes the corresponding content of the other title (lower title) in the content logic, and the parallel relationship means that the contents summarized by the two titles are logically parallel. The titles of the text are hierarchically classified and formatted, and can be used for laying a cushion for subsequent continuity examination.

The text format auditing module for the financial long text rechecking system extracts and analyzes text data by using an NLP model, judges the continuity of the title by using the continuity auditing module, audits the title format at the same time, and can detect and prompt a user when the directory is discontinuous or the cited directory does not exist, so that the continuous legibility of the document is ensured.

The above-disclosed embodiments are merely illustrative of the present invention, which should not be construed as limiting the scope of the invention, and therefore, the present invention is not limited thereto. The scope of the present invention should be determined by the following claims. Any modification, equivalent replacement, improvement or the like that would occur to one skilled in the art and which are within the spirit and principle of this application should be included in the scope of the claims of this application.

Claims

1. A text format auditing module for a financial long text rechecking system is used for auditing the catalog and title formats of a financial long text and judging the continuity of the financial long text, and is characterized in that: the text format auditing module comprises a preprocessing module, a serial number acquiring module, a serial number matching module, a continuity auditing module, a reference auditing module and an output module; wherein,

the preprocessing module is used for dividing text data of an input financial long text, analyzing the text data according to an NLP (line segment language) model to obtain paragraph data, title data, table data and data relations among the paragraph data, the title data and the table data, reading a text directory structure and carrying out hierarchical formatting on the title data according to the text directory structure; the preprocessing module organizes and outputs the analyzed paragraph data, the analyzed title data and the analyzed table data according to a predefined data model and stores the data into a database;

the sequence number acquisition module is used for acquiring a title sequence number and a sequence number format and classifying the sequence numbers, wherein the category of the title sequence number comprises a directory title sequence number, a text title sequence number and an annex table title sequence number;

the serial number matching module is used for matching and storing the serial number acquired by the serial number acquisition module with the characteristic words before and after the serial number;

the continuity auditing module is used for auditing the directory title serial number, the text title serial number and the appendix table title serial number which are acquired by the serial number acquiring module, judging the serial numbers as continuity serial numbers or non-continuity serial numbers, and simultaneously auditing the format of the title data and the format of the corresponding serial number;

the quote auditing module is used for auditing the quote accuracy of the quote serial number in the text data according to the relation between the serial number output by the serial number matching module and the characteristic word;

2. The text format auditing module for a finance long text review system of claim 1, characterized in that: the preprocessing module further comprises a text conversion unit which is used for converting the financial long text in the PDF format into a picture format text, stretching and binarizing the picture format text, and then obtaining text data according to a CV model.

3. The text format auditing module for a financial long text review system of claim 1, characterized by: the continuity checking module comprises a comparing unit used for comparing the continuity of the serial numbers of adjacent titles with the same format and judging whether the serial numbers are continuous serial numbers or discontinuous serial numbers.

4. The text format auditing module for a finance long text review system of claim 1, characterized in that: the output module comprises a display unit and a marking unit, and the display unit is used for displaying the auditing result in a labeling form; the marking unit is used for marking sequence number data or title data of the text with problems and displaying the sequence number data or the title data in the display unit.

5. The text format auditing module for a finance long text review system of claim 1, characterized in that: the NLP model is a pre-training model and is obtained by training large-scale general corpus and financial corpus.

6. The text format auditing module for a financial long text review system of claim 1 that processes financial long text by:

s1: inputting a PDF financial long text;

s2: preprocessing the financial long text by using an NLP model, and storing the processed data in a preset format;

s3: acquiring a title serial number and a serial number format, and classifying the serial numbers;

s4: matching and storing the serial number with the characteristic words before and after the serial number;

s5: checking the continuity of the title serial number, the title format and the corresponding serial number format;

s6: checking whether the sequence numbers quoted in the text data are accurate or not;

s7: and outputting and saving the auditing results in the steps of S5 and S6 in a predefined format.

7. The text format auditing module for the financial long text review system of claim 6, wherein the specific steps of using the NLP model to preprocess the financial long text are:

s20: converting the PDF format financial long text into a picture format text, detecting the picture format text according to a CV model to acquire table, header, footer, picture and formula data, and extracting and organizing character data except the table, the header, the footer, the picture and the formula data;

s21: dividing the text data into paragraph data and title data according to an NLP model, and acquiring form data in the paragraph data;

s22: the paragraph data and the title data are subjected to data cleaning, data length cutting and data extraction position positioning, the paragraph data and the title data after being analyzed and processed according to the NLP model are output and stored according to a predefined data model after data relation is extracted;

s23: carrying out data cleaning and set division on the table data, analyzing the processed table data according to the NLP model, extracting a data relation, and outputting and storing according to a predefined data model;

s24: acquiring a text directory structure;

s25: the title data of S23 is read and title level formatting is performed.

8. The text format auditing module of claim 7 for a finance long text review system, in which the title level formatting method is:

s250: determining title leading relation among title data;

s251: determining a title level according to the title leading relationship;

s252: different levels of title are formatted.