WO2019237540A1

WO2019237540A1 - Method and device for acquiring financial data, terminal device, and medium

Info

Publication number: WO2019237540A1
Application number: PCT/CN2018/105532
Authority: WO
Inventors: 苏晓明; 汪伟; 王晓伟; 徐冰; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-12
Filing date: 2018-09-13
Publication date: 2019-12-19
Also published as: CN109062874A; CN109062874B

Abstract

A method and device for acquiring financial data, a terminal device, and a medium, applicable in the technical field of data processing, reducing the difficulty in acquiring corporate financial data, and effecting a multidimensional acquisition of financial data. The method comprises: acquiring pre-released text to be analyzed, the initial format of said text is the Portable Document Format (PDF) (S101); converting said text from PDF to the DOC format via a preset text conversion tool (S102); acquiring a text encoding corresponding to said text on the basis of said text in the DOC format, where the text encoding comprises multiple types of page labels (S103); searching for a table label in the page labels and positioning, on the basis of the position of text pertaining to the text label, a table in said text (S104); extracting field values associated with the table and table description information (S105); outputting the table description information and the field values to a pre-created text file, thus allowing a service system to perform a recognition processing with respect to the text file and then to acquire financial data associated with said text (S106).

Description

Method, device, terminal equipment and medium for obtaining financial data

This application claims the priority of a Chinese patent application filed on June 12, 2018 with the Chinese Patent Office, application number 201810600697.4, and the invention name is "Methods for Obtaining Financial Data, Terminal Equipment and Media", the entire contents of which are incorporated by reference In this application.

Technical field

The present application belongs to the technical field of data processing, and particularly relates to a method, an apparatus, a terminal device, and a computer-readable storage medium for acquiring financial data.

Background technique

Documents such as quarterly reports, annual reports and prospectuses are public documents of the enterprise. Public documents contain a lot of valuable financial data. For example, corporate accounts receivable, accounts payable, income and expenditure status, profit and loss amounts, and overall debt status. After reprocessing and analysis of these financial data, they can show great reference value. For example, in various applications, these financial data can be used to independently analyze the operating status of an enterprise and determine the status of the industrial chain of the industry to which the enterprise is associated.

However, due to the complexity of public documents such as quarterly reports, annual reports, and prospectuses, the industry has not yet disclosed the need to automatically extract and analyze financial data from these public documents. Therefore, multi-dimensional acquisition of financial data cannot be achieved.

technical problem

In view of this, embodiments of the present application provide a method, an apparatus, a terminal device, and a medium for acquiring financial data, so as to solve the problem that multiple-dimensional acquisition of financial data cannot be achieved in the prior art.

Technical solutions

A first aspect of the embodiments of the present application provides a method for acquiring financial data, including:

Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;

Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;

Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;

Extracting each field value and form description information associated with the form;

Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.

A second aspect of the embodiments of the present application provides an apparatus for acquiring financial data, and the monitoring apparatus includes a unit for executing the method for acquiring financial data described in the first aspect.

A third aspect of the embodiments of the present application provides a terminal device including a memory and a processor. The memory stores computer-readable instructions executable on the processor, and the processor executes the computer-readable instructions. The steps of the method for obtaining financial data as described in the first aspect are implemented when the instruction is read.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented as described in the first aspect when executed by a processor. Steps in the method of obtaining financial data.

Beneficial effect

In the embodiment of the present application, since the public documents such as the prospectus, annual report and quarterly report obtained in the original loading exist in the pdf format, by converting the text format of these public documents to the doc format, the text to be analyzed can be read. Corresponding text encoding, so as to determine the location area to which the form belongs according to the form label in the text encoding, to realize the automatic positioning of the form; in the above public documents, the data information contained in the form is usually of high mining value Financial data. Therefore, after locating the positions of each table, by extracting the field values associated with the table and the table description information, and outputting it to a pre-created text file, it is guaranteed that other business systems can respond to the strong compatibility The text document is read and analyzed, thereby achieving rapid analysis of corporate financial data, avoiding the need to read corporate financial data based on public files of complex styles, thereby reducing the difficulty of obtaining corporate financial data; due to the business system Various types of public documents can be automatically identified through the above text documents Financial data file contains, with respect to the prior art it is further achieved multi-dimensional data acquisition financial effects.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an implementation flowchart of a method for acquiring financial data provided by an embodiment of the present application;

FIG. 2 is a detailed implementation flowchart of a method S104 for obtaining financial data according to an embodiment of the present application; FIG.

FIG. 3 is a detailed implementation flowchart of a method S105 for obtaining financial data provided by an embodiment of the present application; FIG.

FIG. 4 is another specific implementation flowchart of a method S105 for obtaining financial data according to an embodiment of the present application; FIG.

FIG. 5 is an implementation flowchart of a method for acquiring financial data provided by another embodiment of the present application; FIG.

6 is a structural block diagram of an apparatus for acquiring financial data provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.

Embodiments of the invention

In order to explain the technical solution described in this application, the following description is made through specific embodiments.

FIG. 1 illustrates an implementation flow of a method for acquiring financial data provided by an embodiment of the present application. The method flow includes steps S101 to S106. The specific implementation principle of each step is as follows:

S101: Obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.

In the embodiment of the present application, the texts to be analyzed are public documents issued by the enterprise, including quarterly reports, annual reports, and prospectuses. Download the text to be analyzed from the corresponding public website regularly according to preset website information. Among them, when companies create the above public documents, they use portable documents (Portable Document Format (PDF) format for output, so the format of the text to be analyzed downloaded from the above public website is PDF format.

S102: Convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.

For each text to be analyzed in pdf format, import it into a preset text conversion tool, and after detecting the format conversion instruction sent by the user, output a file to be analyzed based on the document (doc) format. The text conversion tool may be, for example, a Foxit converter, a PDF converter, and a quick converter.

S103: Obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags.

For the text to be analyzed in doc format, read the text encoding of the text to be analyzed. Text encoding contains many types of page tags, such as table tags and paragraph paragraph tags.

S104: Find a form label in the page label, and locate a form existing in the text to be analyzed according to a text position to which the form label belongs.

In the embodiment of the present application, the text encoding corresponding to the text to be analyzed is traversed to sequentially detect various types of page tags appearing in the text encoding through a preset regular expression. And, among the detected page tags, each form tag is located based on a tag character element corresponding to the form tag.

If any table label in the text to be analyzed is located, it is determined that the text code adjacent to the table label is a text code that matches a table in the text to be analyzed. Therefore, according to the text position to which the table label belongs, The position of the table in the text to be analyzed can be determined.

As an embodiment of the present application, FIG. 2 shows a specific implementation process of the method S104 for obtaining financial data provided by an embodiment of the present application, which is detailed as follows:

S1041: traverse each coding block in the text coding in sequence.

S1042: For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type.

S1043: If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table.

S1044: Return and execute the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, and the text corresponding to the encoding block is The position is marked as the end position of the table.

In the embodiment of the present application, the text encoding includes a plurality of encoding blocks, and each block has a corresponding page label. Through the preset Document python plugin, each block in the text encoding is read in turn. According to the different page tags, the page tag type of each block is determined. If the page tag corresponding to the block is a table tag, it is determined that the page tag type of the block is a table type; if the page tag corresponding to the block is a paragraph tag, the page tag type of the block is determined to be a paragraph type.

In the embodiment of the present application, if the page tag type of any block is detected as a table type, for the text position to which the block belongs, the attribute value of the start_table flag bit of the text position is set to a logical true value of true to Mark the text position as the starting position of a table currently detected. After that, return to step S1041 to find the next block existing in the text encoding from the current text position, and execute the subsequent steps S1042 to S1044.

After setting the attribute value of the start_table flag in the above text position to a logical truth value, if a corresponding page tag is detected in any subsequent block, and the page tag type is non-table type (for example, it may be a paragraph type) The value of the flag bit end_table of the text position to which the block belongs is set to a logical true value to mark the text position as the end position of a table currently detected.

According to the flag bit information corresponding to each text position in the text to be analyzed, the first text position where the start_table flag is true and the second text position where the end_table flag which appears for the first time after the first text is set to true are determined as and The text area corresponding to a table.

The embodiment of the present application is applicable to a scenario in which a page display table exists in the text to be analyzed. For example, in the text to be analyzed in pdf format, if the height of a table is large, the table will be displayed across pages, that is, the table is divided into at least two sub-tables, so that each sub-table is displayed separately On a page of text to analyze. Therefore, after converting the text format of the text to be analyzed to doc format, in order to be able to restore the same table based on different blocks in the text encoding, it can be determined when the page tag types of both blocks are continuously monitored as table types. The text positions to which the two blocks belong are the position areas where the table exists. If the page tag type of the next block is detected as a paragraph type, it means that the above table is terminated. Therefore, based on the text position to which the block belongs and the text position to which each previous block belongs, it is possible to locate and extract the existence of the text to be analyzed. A complete form.

In the embodiment of the present application, by detecting the table type of each coding block in the text to be analyzed, the attribute value of the built-in flag bit corresponding to each text position can be determined, so as to accurately identify the content in the text to be analyzed based on each attribute value. The starting and ending positions of the existing forms, thereby realizing automatic identification of the forms displayed on the page, so that various financial data can be classified under the same form after being extracted, thereby improving the accuracy of the form data extraction.

S105: Extract each field value and table description information associated with the table.

After locating each table contained in the text to be analyzed, through the Document python plug-in, read the cell content of each block corresponding to the table and store its cell content into the preset table_data array, then the table_data array The data included is the value of each field associated with the table.

In the embodiment of the present application, the form description information is used to describe the main content of the form data, including but not limited to the title, name, or descriptive information of the form. For example, if the table data is the financial expenditure data of Enterprise A in March, the table description information may be "March fiscal expenditure data".

Exemplarily, according to the location area to which each table belongs, multiple character values before the location area or after the location area may be extracted to determine it as the table description information of the table.

As an embodiment of the present application, FIG. 3 shows a specific implementation process of the method S105 for obtaining financial data provided in the embodiment of the present application, which is detailed as follows:

S10501: Create a FIFO queue.

S10502: traverse each coding block in the text encoding in sequence, and obtain the page tag type corresponding to the currently traversed coding block.

S10503: If the type of the page tag corresponding to the encoding block is a paragraph type, sequentially store each character contained in the encoding block into the FIFO queue, and read the real-time queue length of the FIFO queue.

S10504: if the real-time queue length of the FIFO queue is greater than a preset threshold, remove a plurality of the characters existing at the bottom of the FIFO queue, and return to execute the sequential traversal of each encoding block in the text encoding and obtain The operation of the page label type corresponding to the currently traversed coding block.

S10505: If the page tag type corresponding to the coding block is a table type, stitch each character in the FIFO queue, and output the stitching result as table description information associated with the table.

For each form located, in order to extract the form description information of the form, first create a first-in, first-out queue with a preset length (First Input First Output (FIFO). According to the text position to which the table belongs, each block before the text position is determined, and the page tag type of each block is read in turn. If the page tag of any block is non-empty and its page tag type is paragraph type, the cell content of the block is pushed into the FIFO queue.

In the embodiment of the present application, before the cell content of the block is pushed into the FIFO queue, the real-time queue length of the FIFO queue is obtained according to the number of characters contained in the FIFO queue. If the real-time queue length is greater than the preset queue length value, it indicates that the FIFO queue is full. Therefore, the data that enters the FIFO queue first is eliminated, so as to push the currently read block cell content into the processed FIFO. In the queue. Thereafter, return to and execute the above S1052, and when the page label type of the read block is a table type, stop pushing the cell content of any block into the FIFO queue.

In the embodiment of the present application, after the cell content of the block is stopped from being pushed into the FIFO queue, each character contained in the FIFO queue is extracted, and a character string obtained by splicing each character is output as table description information associated with a table.

In the embodiment of the present application, when a block having a page tag type of a table type is detected, by stopping the cell content of the block from being pushed into the FIFO queue, it is ensured that each character stored in the FIFO queue is the text information closest to the table location area. . Generally speaking, since the text information closest to the location area of the table can best reflect the main content of the table data (for example, the header information at the top of the table), by stitching the characters in the FIFO queue and outputting the result of the splicing For the table description information associated with the table, automatic positioning of the table description information is achieved, and the accuracy of extracting the table description information is improved.

As an embodiment of the present application, FIG. 4 shows another specific implementation process of the method S105 for obtaining financial data provided in the embodiment of the present application, which is detailed as follows:

S10506: If the page tag type corresponding to the coding block is a table type, obtain a regular expression associated with a preset keyword.

S10507: Perform detection processing on each character string in the FIFO queue based on the regular expression.

S10508: If the character string matching the regular expression exists in the FIFO queue, output the character string as form description information associated with the form.

S10509: If the character string matching the regular expression does not exist in the FIFO queue, calculate a tag distance value between each of the character strings in the FIFO queue and the table label in the coding block to which the character string belongs. .

S10510: Output one of the character strings with the smallest tag distance value as table description information associated with the table.

In the embodiment of the present application, extracting the table description information associated with the table based on the text information before the table specifically includes: after the cell content of the block whose page label type is the table type is pushed into the FIFO queue, obtaining and presetting A regular expression associated with the associated word. Wherein, the preset related words are characters having a high degree of relevance to the descriptive information of the table such as the table title. For example, common table titles usually exist in the format of "XXX table", so the regular expression corresponding to the class table title can be "[\ s \ S] * \ 表 $". In the block whose page label type is a table type, based on the obtained regular expression, each string stored in the FIFO queue is detected and processed.

If a character string satisfying the above regular expression is detected in the FIFO queue, the character string is extracted and output as the table description information associated with the table.

If no string that meets the above regular expression is detected in the FIFO queue, it means that there is no descriptive information similar to the table title before the text position to which the table belongs. At this time, N adjacent FIFO queues are used. (N is a preset value, and N is an integer greater than 1.) The character is a character string. According to the style tag of the block to which the last character belongs, read the tag distance value of the block. The label distance value indicates the distance between the text position of the character and the bottom of the current page. Based on this method, after obtaining the tag distance value of each character string in the FIFO queue, a character string with the smallest tag distance value is selected. A string with the smallest tag distance value is output as table description information associated with the table.

In the embodiment of the present application, since the string with the smallest tag distance value is closer to the bottom of the page, and the block to which the string belongs is located before the table, it can be determined that the text position to which the string belongs also corresponds to the start of the table. The starting position is the closest. Generally speaking, the text information closest to the starting position of the table can more clearly describe the subject content of the table data. Therefore, by outputting this string as the table description information associated with the table, the table is also improved to a certain extent Describe the accuracy of the information.

S106: Output the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.

In the embodiment of the present application, after obtaining each field value in the form and obtaining the form description information associated with the form, according to the sequence of obtaining each character, the form description information and each field value are sequentially output to a pre-created text document . Among them, the text format of the text document is txt format.

Preferably, in the text document, a preset separator is inserted between any two adjacent field values.

Preferably, the form description information is output at the top position of the above text document, and a line break is inserted between the form description information and the field value.

In the embodiment of the present application, the text document is sent to each service system connected in advance. Because the business systems of each version type have better compatibility with text files in txt format, the business system can identify and process the text files to extract the financial data associated with the text to be analyzed.

The embodiment of the present application realizes the rapid analysis of corporate financial data, avoids the need to read corporate financial data based on public files of complex styles, thereby reducing the difficulty of obtaining corporate financial data; because the business system can automatically use the above text files to automatically Identifying the financial data contained in various types of public documents, compared to the prior art, multi-dimensional acquisition of financial data has also been achieved.

As another embodiment of the present application, as shown in FIG. 5, after the above S106, the method further includes:

S107: Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template.

S108: Generate and display financial data analysis reports based on the import results.

In the embodiment of the present application, a pre-generated report template is loaded, the report template includes various headers, each header corresponds to a body, and each header is used to describe a field attribute of a field value in the form, Each table body is used to record a field value. For each header set in the report template, according to the field attributes described in the header, in the data of the text document generated in S106, the field value corresponding to the field attribute is filtered, and the The field values are imported into the table body corresponding to the header of the report template. According to the field value of each field attribute imported by the report template, each statistical information value is calculated through a preset calculation formula to import the obtained statistical results to the footer of the report template, and then output and display the financial data Analyze the report.

In the embodiment of the present application, the field values in the text document are imported into a pre-generated report template, so that the final financial data analysis report can list the field values in the data analysis process in detail, which is convenient for users to check the analysis of financial data Whether the process is wrong, thereby further improving the reliability and accuracy of financial data analysis reports.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

Corresponding to the method for acquiring financial data described in the foregoing embodiment, FIG. 6 shows a structural block diagram of a device for acquiring financial data provided by an embodiment of the present application. For convenience of explanation, only the relevant data of the embodiment of the present application are shown. section.

Referring to FIG. 6, the device includes:

The first obtaining unit 61 is configured to obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.

The conversion unit 62 is configured to convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.

The second obtaining unit 63 is configured to obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format, where the text encoding includes multiple types of page tags.

The searching unit 64 is configured to search for a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs.

The extraction unit 65 is configured to extract various field values and table description information associated with the table.

An output unit 66 is configured to output the form description information and each of the field values to a pre-created text document, so that the business system obtains the text document to be analyzed after the text document is identified. Financial data.

Optionally, the search unit 64 includes:

The traversing subunit is used to sequentially traverse each coding block in the text coding.

The judging subunit is configured to judge, for each of the coding blocks, whether a page tag type corresponding to the coding block is a table type.

A marking subunit, configured to set the attribute value of the built-in flag bit to a logical truth value if the page tag type corresponding to the coding block is a table type, so as to mark the text position corresponding to the coding block as the start of the table position.

A return subunit, for returning to perform the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, the encoding block The corresponding text position is marked as the end position of the table.

Optionally, the extraction unit 65 includes:

Create a subunit to create a FIFO queue.

An acquisition subunit is configured to sequentially traverse each coding block in the text encoding, and obtain a page tag type corresponding to the currently traversed coding block.

A storage subunit, configured to sequentially store each character contained in the encoding block into the FIFO queue if the page tag type corresponding to the encoding block is a paragraph type, and read the real-time of the FIFO queue The queue length.

A removing subunit, configured to remove a plurality of the characters existing at the bottom of the FIFO queue if the real-time queue length of the FIFO queue is greater than a preset threshold, and return to execute each of the text encoding in turn An operation of encoding a block and obtaining a page tag type corresponding to the currently traversed encoding block.

The splicing subunit is configured to splice each character in the FIFO queue if the page tag type corresponding to the coding block is a table type, and output the splicing result as table description information associated with the table.

Optionally, the splicing subunit is specifically configured to: if the page tag type corresponding to the coding block is a table type, obtain a regular expression associated with a preset keyword;

Performing detection processing on each character string in the FIFO queue based on the regular expression;

If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;

If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;

Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.

Optionally, the apparatus for acquiring financial data further includes: a loading unit for loading a report template, and importing each of the financial data into a corresponding table according to a pre-set header in the report template. Body.

A generating unit is used to generate and display financial data analysis reports based on the import results.

FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 7, the terminal device 7 of this embodiment includes a processor 70 and a memory 71. The memory 71 stores computer-readable instructions 72 that can be run on the processor 70, such as a program for acquiring financial data. . When the processor 70 executes the computer-readable instructions 72, the steps in the embodiment of the method for acquiring financial data are implemented, for example, steps 101 to 106 shown in FIG. Alternatively, when the processor 70 executes the computer-readable instructions 72, the functions of the modules / units in the foregoing device embodiments are implemented, for example, the functions of the units 61 to 66 shown in FIG. 6.

Exemplarily, the computer-readable instructions 72 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 71 and executed by the processor 70, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 72 in the terminal device 7.

The terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 70 and a memory 71. Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. It may include more or fewer components than shown in the figure, or combine some components or different components. For example, the terminal device may further include an input / output device, a network access device, a bus, and the like.

The processor 70 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (Application Specific Integrated Circuits) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD) provided on the terminal device 7. Card, flash card, etc. Further, the memory 71 may further include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is configured to store the computer-readable instructions and other programs and data required by the terminal device. The memory 71 may also be used to temporarily store data that has been output or is to be output.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks, or compact discs, and other media that can store program codes .

As mentioned above, the above embodiments are only used to describe the technical solution of the present application, rather than limiting the present invention. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still interpret the foregoing. The technical solutions described in the embodiments are modified, or some technical features are equivalently replaced; and these modifications or replacements do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for obtaining financial data, comprising:

Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;

Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;

Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;

Extracting each field value and form description information associated with the form;

Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
The method for obtaining financial data according to claim 1, wherein the search finds a table label in the page label, and locates the existing text in the text to be analyzed according to the text position to which the table label belongs. Forms, including:

Successively traverse each coding block in the text coding;

For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type;

If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table;

Return to perform the operation of sequentially traversing each coding block in the text encoding until the page label type corresponding to the extracted coding block is a non-table type and a non-null value, mark the text position corresponding to the coding block Is the end position of the table.
The method for obtaining financial data according to claim 1, wherein the extracting each field value and form description information associated with the form comprises:

Create FIFO queues;

Successively traverse each coding block in the text encoding, and obtain the page tag type corresponding to the currently traversed coding block;

If the page tag type corresponding to the coding block is a paragraph type, storing each character contained in the coding block into the FIFO queue in order, and reading the real-time queue length of the FIFO queue;

If the real-time queue length of the FIFO queue is greater than a preset threshold, removing a plurality of the characters existing at the bottom of the FIFO queue, and returning to executing the sequential traversal of each encoding block in the text encoding, and obtaining the current Operations of page tag types corresponding to the traversed code blocks;

If the page tag type corresponding to the coding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as table description information associated with the table.
The method for obtaining financial data according to claim 3, wherein if the page tag type corresponding to the encoding block is a table type, the characters in the FIFO queue are spliced, and the splicing is performed. The result output is table description information associated with the table, including:

If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;

Performing detection processing on each character string in the FIFO queue based on the regular expression;

If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;

If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;

Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
The method for obtaining financial data according to claim 1, wherein, in the step of outputting the form description information and each of the field values to a pre-created text document, a business system makes the text document After performing the identification processing, after obtaining the financial data associated with the text to be analyzed, the method further includes:

Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;

Generate and display financial data analysis reports based on the import results.
An apparatus for acquiring financial data, which is characterized by comprising:

The first obtaining unit is configured to obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.

A conversion unit, configured to convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool;

A second obtaining unit, configured to obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;

A searching unit, configured to find a form label in the page label, and locate a form existing in the text to be analyzed according to a text position to which the form label belongs;

An extraction unit, configured to extract each field value and table description information associated with the table;

An output unit, configured to output the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial information associated with the text to be analyzed after identifying and processing the text document data.
The apparatus for acquiring financial data according to claim 6, wherein the search unit comprises:

A traversal subunit, for sequentially traversing each encoding block in the text encoding;

A judging subunit, configured to judge, for each of the coding blocks, whether a page tag type corresponding to the coding block is a table type;

A marking subunit, configured to set the attribute value of the built-in flag bit to a logical truth value if the page tag type corresponding to the coding block is a table type, so as to mark the text position corresponding to the coding block as the start of the table position;

A return subunit, for returning to perform the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, the encoding block The corresponding text position is marked as the end position of the table.
The apparatus for acquiring financial data according to claim 6, wherein the extraction unit comprises:

Create a sub-unit for creating a FIFO queue;

An acquisition subunit, configured to sequentially traverse each encoding block in the text encoding, and obtain a page tag type corresponding to the currently traversed encoding block;

A storage subunit, configured to sequentially store each character contained in the encoding block into the FIFO queue if the page tag type corresponding to the encoding block is a paragraph type, and read the real-time of the FIFO queue Queue length

A removing subunit, configured to remove a plurality of the characters existing at the bottom of the FIFO queue if the real-time queue length of the FIFO queue is greater than a preset threshold, and return to execute each of the text encoding in turn An operation of encoding a block and obtaining a page tag type corresponding to the currently traversed encoding block;

The splicing subunit is configured to splice each character in the FIFO queue if the page tag type corresponding to the coding block is a table type, and output the splicing result as table description information associated with the table.
The apparatus for acquiring financial data according to claim 8, wherein the splicing subunit is specifically configured to:

If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;

Performing detection processing on each character string in the FIFO queue based on the regular expression;

If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;

If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;

Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
The apparatus for acquiring financial data according to claim 6, further comprising:

A loading unit, configured to load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;

A generating unit is used to generate and display financial data analysis reports based on the import results.
A terminal device includes a memory and a processor, and the memory stores computer-readable instructions that can be run on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: :

Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;

Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;

Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;

Extracting each field value and form description information associated with the form;

Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
The terminal device according to claim 11, wherein the searching for a form tag in the page tag, and locating a form existing in the text to be analyzed according to a text position to which the form tag belongs, comprises: :

Successively traverse each coding block in the text coding;

For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type;

If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table;

Return to perform the operation of sequentially traversing each coding block in the text encoding until the page label type corresponding to the extracted coding block is a non-table type and a non-null value, mark the text position corresponding to the coding block Is the end position of the table.
The terminal device according to claim 11, wherein the extracting each field value and table description information associated with the table comprises:

Create FIFO queues;

Successively traverse each coding block in the text encoding, and obtain the page tag type corresponding to the currently traversed coding block;

If the page tag type corresponding to the coding block is a paragraph type, storing each character contained in the coding block into the FIFO queue in order, and reading the real-time queue length of the FIFO queue;

If the real-time queue length of the FIFO queue is greater than a preset threshold, removing a plurality of the characters existing at the bottom of the FIFO queue, and returning to executing the sequential traversal of each encoding block in the text encoding and obtaining the Operations of page tag types corresponding to the traversed code blocks;

If the page tag type corresponding to the coding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as table description information associated with the table.
The terminal device according to claim 13, wherein if the page label type corresponding to the encoding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as The form description information associated with the form includes:

If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;

Performing detection processing on each character string in the FIFO queue based on the regular expression;

If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;

If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;

Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
The terminal device according to claim 11, wherein the processor further implements the following steps when executing the computer-readable instructions:

Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;

Generate and display financial data analysis reports based on the import results.
A computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions implement the following steps when executed by at least one processor:

Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;

Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;

Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;

Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;

Extracting each field value and form description information associated with the form;

Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
The computer-readable storage medium according to claim 16, wherein the search finds a form tag in the page tag, and locates the presence of the text in the text to be analyzed according to the text position to which the form tag belongs. Forms, including:

Successively traverse each coding block in the text coding;

For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type;

If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table;

Return to perform the operation of sequentially traversing each coding block in the text encoding until the page label type corresponding to the extracted coding block is a non-table type and a non-null value, mark the text position corresponding to the coding block Is the end position of the table.
The computer-readable storage medium according to claim 16, wherein the extracting each field value and form description information associated with the form comprises:

Create FIFO queues;

Successively traverse each coding block in the text encoding, and obtain the page tag type corresponding to the currently traversed coding block;

If the page tag type corresponding to the coding block is a paragraph type, storing each character contained in the coding block into the FIFO queue in order, and reading the real-time queue length of the FIFO queue;

If the real-time queue length of the FIFO queue is greater than a preset threshold, removing a plurality of the characters existing at the bottom of the FIFO queue, and returning to executing the sequential traversal of each encoding block in the text encoding and obtaining the Operations of page tag types corresponding to the traversed code blocks;

If the page tag type corresponding to the coding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as table description information associated with the table.
The computer-readable storage medium according to claim 18, wherein if the page label type corresponding to the encoding block is a table type, stitching each character in the FIFO queue, and stitching The result output is table description information associated with the table, including:

If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;

Performing detection processing on each character string in the FIFO queue based on the regular expression;

If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;

If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;

Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
The computer-readable storage medium according to claim 16, wherein when the computer-readable instructions are executed by at least one processor, the following steps are further implemented:

Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;

Generate and display financial data analysis reports based on the import results.