CN109062874B - Financial data acquisition method, terminal device and medium - Google Patents

Financial data acquisition method, terminal device and medium Download PDF

Info

Publication number
CN109062874B
CN109062874B CN201810600697.4A CN201810600697A CN109062874B CN 109062874 B CN109062874 B CN 109062874B CN 201810600697 A CN201810600697 A CN 201810600697A CN 109062874 B CN109062874 B CN 109062874B
Authority
CN
China
Prior art keywords
text
coding block
analyzed
tag
fifo queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810600697.4A
Other languages
Chinese (zh)
Other versions
CN109062874A (en
Inventor
苏晓明
汪伟
王晓伟
徐冰
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810600697.4A priority Critical patent/CN109062874B/en
Priority to PCT/CN2018/105532 priority patent/WO2019237540A1/en
Publication of CN109062874A publication Critical patent/CN109062874A/en
Application granted granted Critical
Publication of CN109062874B publication Critical patent/CN109062874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/06Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll

Abstract

The invention is suitable for the technical field of data processing, and provides a financial data acquisition method, terminal equipment and a medium, wherein the method comprises the following steps: acquiring a pre-issued text to be analyzed; converting the text format of the text to be analyzed from a pdf format into a document doc format through a preset text conversion tool; acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; the text code comprises a plurality of types of page tags; searching a table tag in the page tag, and positioning a table in a text to be analyzed according to the text position to which the table tag belongs; extracting each field value associated with the table and table description information; and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed. The method and the device reduce the acquisition difficulty of the financial data of the enterprise and achieve the multi-dimensional acquisition effect of the financial data.

Description

Financial data acquisition method, terminal device and medium
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a financial data acquisition method, terminal equipment and a computer readable storage medium.
Background
Documents such as quarterly newspapers, annual newspapers and posters are all published documents of enterprises. The disclosure contains a great deal of valuable financial data. For example, the enterprise accounts receivable, accounts payable, balance status, amount of profit or loss, and overall debt status, etc. The financial data can show great reference value after being processed again and analyzed. For example, in various applications, such financial data may be used to independently analyze business conditions of the enterprise, determine industry-industry chain conditions associated with the enterprise, and the like.
However, since the styles of the public documents such as the quarterly newspaper, the annual newspaper, and the stock book are complicated, the automatic extraction and analysis of the financial data for the public documents are not disclosed for a while, and thus, the multi-dimensional acquisition of the financial data cannot be realized.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method for acquiring financial data, a terminal device, and a computer-readable storage medium, so as to solve the problem that multi-dimensional acquisition of financial data cannot be realized in the prior art.
A first aspect of an embodiment of the present invention provides a method for acquiring financial data, including:
acquiring a pre-published text to be analyzed, wherein the initial format of the text to be analyzed is a portable document pdf format;
converting the text format of the text to be analyzed from the pdf format into a document doc format through a preset text conversion tool;
acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text code comprises a plurality of types of page tags;
searching a table tag in the page tag, and positioning a table in the text to be analyzed according to the text position to which the table tag belongs;
extracting each field value associated with the table and table description information;
and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed.
A second aspect of the embodiments of the present invention provides a terminal device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the following steps when executing the computer program:
acquiring a pre-published text to be analyzed, wherein the initial format of the text to be analyzed is a portable document pdf format;
converting the text format of the text to be analyzed from the pdf format into a document doc format through a preset text conversion tool;
acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text code comprises a plurality of types of page tags;
searching a table tag in the page tag, and positioning a table in the text to be analyzed according to the text position to which the table tag belongs;
extracting each field value associated with the table and table description information;
and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed.
A third aspect of embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of:
acquiring a pre-published text to be analyzed, wherein the initial format of the text to be analyzed is a portable document pdf format;
converting the text format of the text to be analyzed from the pdf format into a document doc format through a preset text conversion tool;
acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text code comprises a plurality of types of page tags;
searching a table tag in the page tag, and positioning a table in the text to be analyzed according to the text position to which the table tag belongs;
extracting each field value associated with the table and table description information;
and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed.
In the embodiment of the invention, because the originally loaded public documents such as the stock book, the yearly newspaper, the quarterly newspaper and the like exist in the pdf format, the text format of the public documents is converted into the doc format, and the text code corresponding to the text to be analyzed can be read, so that the position area of the form is determined according to the form label in the text code, and the automatic positioning of the form is realized; in the above-mentioned open document, the data information contained in the table is usually the financial data with higher mining value, therefore, after each table position is obtained by positioning, the field value associated with the table and the table description information are extracted and output to the pre-created text document, so that other business systems can read and analyze the text document with stronger compatibility, thereby realizing the rapid analysis of the enterprise financial data, avoiding the need of reading the enterprise financial data based on the open document with complex style, and reducing the difficulty of obtaining the enterprise financial data; because the business system can automatically identify the financial data contained in various open files through the text documents, compared with the prior art, the multi-dimensional acquisition effect of the financial data is also achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of an implementation of a method for acquiring financial data according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating an implementation of the financial data acquiring method S104 according to an embodiment of the present invention;
fig. 3 is a flowchart of a specific implementation of the financial data acquiring method S105 according to an embodiment of the present invention;
fig. 4 is a flowchart of another specific implementation of the financial data acquiring method S105 according to the embodiment of the present invention;
FIG. 5 is a flow chart of an implementation of a method for obtaining financial data according to another embodiment of the present invention;
fig. 6 is a block diagram showing the configuration of an apparatus for acquiring financial data according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Fig. 1 shows an implementation flow of the method for acquiring financial data according to the embodiment of the present invention, where the method flow includes steps S101 to S106. The specific realization principle of each step is as follows:
s101: the method comprises the steps of obtaining a pre-published text to be analyzed, wherein the initial format of the text to be analyzed is a portable document pdf format.
In the embodiment of the invention, the text to be analyzed is a public document released by an enterprise, including a quarterly newspaper, an annual newspaper, a stock book and the like. And downloading the text to be analyzed from the corresponding public website periodically according to preset website information. When the enterprise creates the open Document, the Document to be analyzed downloaded from the open website is in a Portable Document Format (PDF) Format, because the Document is output in the PDF Format.
S102: and converting the text format of the text to be analyzed from the pdf format into a document doc format by a preset text conversion tool.
And importing each text to be analyzed in pdf format into a preset text conversion tool, and outputting a file to be analyzed based on document (doc) format after detecting a format conversion instruction sent by a user. The text conversion tool may be, for example, a fuxin converter, a PDF converter, an agile converter, or the like.
S103: acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding comprises a plurality of types of page tags.
And for the text to be analyzed in the doc format, reading the text code of the text to be analyzed. The text code includes various types of page tags, such as a table tag and a paragraph tag.
S104: and searching a table tag in the page tag, and positioning a table in the text to be analyzed according to the text position to which the table tag belongs.
In the embodiment of the invention, the text code corresponding to the text to be analyzed is traversed, so that various page tags appearing in the text code are sequentially detected through the preset regular expression. And, in the detected page tags, each form tag is positioned based on the tag character element corresponding to the form tag.
If any table label in the text to be analyzed is obtained through positioning, the text code adjacent to the table label is determined to be the text code matched with one table in the text to be analyzed, and therefore the positioning corresponding to the table in the text to be analyzed can be determined according to the text position of the table label.
As an embodiment of the present invention, fig. 2 shows a specific implementation flow of the financial data obtaining method S104 provided by the embodiment of the present invention, which is detailed as follows:
s1041: and traversing each coding block in the text coding in sequence.
S1042: and judging whether the page tag type corresponding to each coding block is a table type or not.
S1043: if the page tag type corresponding to the coding block is a table type, setting the attribute value of the built-in flag bit as a logic true value so as to mark the text position corresponding to the coding block as the initial position of the table.
S1044: and returning to execute the operation of sequentially traversing each coding block in the text codes until the page tag type corresponding to the taken-out coding block is a non-table type and a non-null value, and marking the text position corresponding to the coding block as the end position of the table.
In the embodiment of the invention, the text coding comprises a plurality of coding blocks (blocks), and each block has a corresponding page tag. And reading each block in the text codes in turn through a preset Document python plug-in. And determining the page tag type of each block according to the difference of the page tags. If the page tag corresponding to the block is a table tag, determining that the page tag type of the block is a table type; and if the page tag corresponding to the block is a paragraph tag, determining that the page tag type of the block is a paragraph type.
In the embodiment of the present invention, if it is detected that the page tag type of any block is a table type, for a text position to which the block belongs, an attribute value of a flag bit, which is a start _ table of the text position, is set to a logical true value true, so as to mark the text position as a start position of a currently detected table. Thereafter, the process returns to step S1041 to search for the next block existing in the text encoding from the current text position, and the subsequent steps S1042 to S1044 are performed.
After setting the attribute value of the start _ table flag bit of the text position to a logical true value, if it is detected that any subsequent block has a corresponding page tag and the page tag type of the block is a non-table type (for example, a paragraph type may be used), setting the flag bit of the end _ table of the text position to which the block belongs to the logical true value, so as to mark the text position as the end position of the currently detected table.
According to the flag bit information corresponding to each text position in the text to be analyzed, a first text position with a start _ table flag bit of true and a second text position with an end _ table flag bit of true appearing for the first time after the first text is set are determined as text areas corresponding to a table.
The embodiment of the invention is suitable for the scene that the table displayed in pages exists in the text to be analyzed. For example, in the pdf-formatted text to be analyzed, if the height of a table is large, the table will be displayed across pages, that is, after the table is divided into at least two sub-tables, each sub-table is displayed in a page of the text to be analyzed. Therefore, after the text format of the text to be analyzed is converted into the doc format, in order to restore the same table based on different blocks in the text encoding, when the page tag types of the two blocks are continuously monitored to be the table types, it can be determined that the text positions of the two blocks are both the position areas where the table exists. If the page tag type of the next block is detected to be a paragraph type, the table is terminated, and therefore a complete table existing in the text to be analyzed can be positioned and extracted based on the text position to which the block belongs and the text positions to which the blocks belong.
In the embodiment of the invention, the attribute values of the built-in flag bits corresponding to the positions of the texts can be determined by detecting the table types of the coding blocks in the texts to be analyzed, so that the starting and ending positions of the tables in the texts to be analyzed can be accurately identified based on the attribute values, the tables displayed in the pages can be automatically identified, various financial data can be classified under the same table after being extracted, and the accuracy of extracting the table data is improved.
S105: extracting various field values associated with the table and table description information.
After each table contained in the text to be analyzed is located, the cell content of each block corresponding to the table is read through the Document python plug-in, the cell content is stored in a preset table _ data array, and data contained in the table _ data array is each field value associated with the table.
In the embodiment of the present invention, the table description information is used to describe the main content of the table data, including but not limited to the title, name or descriptive information of the table. For example, if the table data is financial expenditure data of enterprise a in 3 months, the table description information may be "financial expenditure data in 3 months".
For example, according to the position area to which each table belongs, a plurality of character values before or after the position area may be extracted to be determined as the table description information of the table.
As an embodiment of the present invention, fig. 3 shows a specific implementation flow of the financial data obtaining method S105 provided by the embodiment of the present invention, which is detailed as follows:
s10501: a first-in first-out FIFO queue is created.
S10502: and sequentially traversing each coding block in the text codes, and acquiring the page tag type corresponding to the currently traversed coding block.
S10503: if the page tag type corresponding to the coding block is a paragraph type, sequentially storing each character contained in the coding block into the FIFO queue, and reading the real-time queue length of the FIFO queue.
S10504: and if the real-time queue length of the FIFO queue is greater than a preset threshold value, removing a plurality of characters at the bottom of the FIFO queue, returning to execute the operation of sequentially traversing each coding block in the text codes and acquiring the page tag type corresponding to the currently traversed coding block.
S10505: and if the page tag type corresponding to the coding block is a table type, splicing all characters in the FIFO queue, and outputting a splicing result as table description information associated with the table.
For each located table, in order to extract the table description information of the table, a First-in First-out (FIFO) queue with a preset length is created First. And determining each block before the text position according to the text position to which the table belongs, and sequentially reading the page tag types of each block. If the page tag of any block is a non-null value and the page tag type is a paragraph type, pushing the cell content of the block into the FIFO queue.
In the embodiment of the invention, before the cell content of the block is pressed into the FIFO queue, the real-time queue length of the FIFO queue is obtained according to the number of characters contained in the FIFO queue. If the length of the real-time queue is greater than the preset queue length value, the FIFO queue is full, and therefore data which are firstly input into the FIFO queue are eliminated, and cell contents of the block which is obtained by reading at present are pressed into the processed FIFO queue. And then, returning to execute the step S1052, and stopping pushing the cell content of any block into the FIFO queue until the read page tag type of the block is the table type.
In the embodiment of the invention, after the cell content of the block is stopped being pressed into the FIFO queue, all the characters contained in the FIFO queue are extracted, and the character strings obtained by splicing all the characters are output as the table description information associated with the table.
In the embodiment of the invention, when the page tag type is detected to be the block of the table type, the cell content of the block is stopped being pressed into the FIFO queue, so that each character stored in the FIFO queue is ensured to be the text information of the area closest to the table position. Generally speaking, since the text information closest to the table location area can most represent the main content of the table data (for example, header information at the top of the table), by splicing the characters in the FIFO queue and outputting the splicing result as the table description information associated with the table, automatic positioning of the table description information is realized, and the extraction accuracy of the table description information is improved.
As an embodiment of the present invention, fig. 4 shows another specific implementation flow of the financial data obtaining method S105 provided by the embodiment of the present invention, which is detailed as follows:
s10506: and if the page tag type corresponding to the coding block is a table type, acquiring a regular expression associated with a preset keyword.
S10507: and detecting each character string in the FIFO queue based on the regular expression.
S10508: and if the character string matched with the regular expression exists in the FIFO queue, outputting the character string as the table description information associated with the table.
S10509: and if the character strings matched with the regular expression do not exist in the FIFO queue, respectively calculating the tag distance value of each character string in the FIFO queue and the table tag in the coding block to which the character string belongs.
S10510: and outputting the character string with the minimum label distance value as the table description information associated with the table.
In the embodiment of the present invention, extracting table description information associated with the table based on text information before the table specifically includes:
after cell contents of block with the page tag type of table type are pressed into an FIFO queue, a regular expression associated with a preset associated word is obtained. The preset associated words are characters with high association degree with table descriptive information such as table titles and the like. For example, common table titles typically exist in the format of "XXX tables," so the regular expression corresponding to such table titles may be "\ S ] \ table $". And in the block with the page tag type of the table type, detecting each character string stored in the FIFO queue based on the obtained regular expression.
If a character string satisfying the regular expression is detected in the FIFO queue, the character string is extracted and output as table description information associated with the table.
If the character strings meeting the regular expression are not detected in the FIFO queue, before the text position to which the table belongs is represented, descriptive information similar to the table title does not exist, at this time, N (N is a preset value and is an integer greater than 1) characters adjacent to each other in the FIFO queue are taken as a character string, and the label distance value of the block is read according to the style label of the block to which the last character belongs. The tag distance value represents a distance value between a text position to which the character belongs and the bottom of the current page. Based on the mode, after the tag distance values of all the character strings in the FIFO queue are respectively obtained, one character string with the smallest tag distance value is selected. And outputting the character string with the minimum label distance value as the table description information associated with the table.
In the embodiment of the invention, because the character string with the minimum label distance value is closer to the bottom of the page, and the block to which the character string belongs is positioned in front of the table, the text position to which the character string belongs can be determined to be closest to the initial position of the table. Generally speaking, the text information closest to the start position of the table can describe the subject content of the table data more clearly, so that the accuracy of the table description information is improved to some extent by outputting the character string as the table description information associated with the table.
S106: and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed.
In the embodiment of the invention, after each field value in the form and the form description information associated with the form are obtained, the form description information and each field value are sequentially output to the pre-established text document according to the sequence of obtaining each character. Wherein the text format of the text document is txt format.
Preferably, in the text document, a preset separator is inserted between any two adjacent field values.
Preferably, the table description information is output to the top position of the text document, and a line break is inserted between the table description information and the field value.
In the embodiment of the invention, the text document is sent to each business system which is connected in advance. Because the business systems of all versions have better compatibility with the text documents in the txt format, the business systems can identify and process the text documents to extract the financial data associated with the texts to be analyzed.
In the embodiment of the invention, because the originally loaded public documents such as the stock book, the yearly newspaper, the quarterly newspaper and the like exist in the pdf format, the text format of the public documents is converted into the doc format, and the text code corresponding to the text to be analyzed can be read, so that the position area of the form is determined according to the form label in the text code, and the automatic positioning of the form is realized; in the above-mentioned open document, the data information contained in the table is usually the financial data with higher mining value, therefore, after each table position is obtained by positioning, the field value associated with the table and the table description information are extracted and output to the pre-created text document, so that other business systems can read and analyze the text document with stronger compatibility, thereby realizing the rapid analysis of the enterprise financial data, avoiding the need of reading the enterprise financial data based on the open document with complex style, and reducing the difficulty of obtaining the enterprise financial data; because the business system can automatically identify the financial data contained in various open files through the text documents, compared with the prior art, the multi-dimensional acquisition effect of the financial data is also achieved.
As another embodiment of the present invention, as shown in fig. 5, after S106, the method further includes:
s107: and loading a report template, and respectively importing each item of financial data into a corresponding table body according to a preset table header in the report template.
S108: and generating and displaying a financial data analysis report according to the import result.
In the embodiment of the invention, a report template generated in advance is loaded, wherein the report template comprises various headers, each header corresponds to a table body, each header is used for describing the field attribute of a field value in the table, and each table body is used for recording a field value. For each preset header in the report template, according to the field attribute described by the header, in each item of data of the text document generated in S106, the field value corresponding to the field attribute is screened out, and the field value is imported into the table body corresponding to the header of the report template.
And respectively calculating various statistical information values through a preset calculation formula according to the field value of each field attribute imported by the report template, importing the obtained statistical result to the tail of the report template, and outputting and displaying the financial data analysis report.
In the embodiment of the invention, by importing the field values in the text document into the report template generated in advance, the finally displayed financial data analysis report can detail list the field values in the data analysis process, so that a user can conveniently check whether the analysis process of the financial data is wrong, and the reliability and the accuracy of the financial data analysis report are further improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 6 shows a block diagram of the financial data acquisition apparatus according to the embodiment of the present invention, which corresponds to the financial data acquisition method described in the above embodiment, and only shows the relevant parts according to the embodiment of the present invention for convenience of description.
Referring to fig. 6, the apparatus includes:
the first obtaining unit 61 is configured to obtain a pre-published text to be analyzed, where an initial format of the text to be analyzed is a portable document pdf format.
The converting unit 62 is configured to convert the text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.
A second obtaining unit 63, configured to obtain, based on the doc-formatted text to be analyzed, a text code corresponding to the text to be analyzed; wherein the text encoding comprises a plurality of types of page tags.
The searching unit 64 is configured to search a table tag in the page tags, and locate a table existing in the text to be analyzed according to a text position to which the table tag belongs.
An extracting unit 65, configured to extract each field value associated with the table and table description information.
An output unit 66, configured to output the form description information and each field value to a pre-created text document, so that after the text document is identified and processed by a business system, financial data associated with the text to be analyzed is obtained.
Optionally, the search unit 64 includes:
and the traversal subunit is used for sequentially traversing each coding block in the text coding.
And the judging subunit is used for judging whether the page tag type corresponding to each coding block is a table type or not.
And the marking subunit is used for setting the attribute value of the built-in flag bit as a logic true value if the page tag type corresponding to the coding block is a table type, so as to mark the text position corresponding to the coding block as the initial position of the table.
And the return subunit is used for returning and executing the operation of sequentially traversing each coding block in the text codes until the page tag type corresponding to the taken-out coding block is a non-table type and a non-null value, and marking the text position corresponding to the coding block as the end position of the table.
Optionally, the extracting unit 65 includes:
and the creating subunit is used for creating a first-in first-out (FIFO) queue.
And the acquiring subunit is used for sequentially traversing each coding block in the text codes and acquiring the page tag type corresponding to the currently traversed coding block.
And the storage subunit is used for sequentially storing each character contained in the coding block into the FIFO queue and reading the real-time queue length of the FIFO queue if the page tag type corresponding to the coding block is a paragraph type.
And the removing subunit is configured to remove the plurality of characters existing at the bottom of the FIFO queue, return to execute the operation of sequentially traversing each coding block in the text code, and acquire the page tag type corresponding to the currently traversed coding block, if the real-time queue length of the FIFO queue is greater than a preset threshold.
And the splicing subunit is configured to splice the characters in the FIFO queue if the page tag type corresponding to the coding block is a table type, and output a splicing result as table description information associated with the table.
Optionally, the splicing subunit is specifically configured to:
if the page tag type corresponding to the coding block is a table type, acquiring a regular expression associated with a preset keyword;
detecting each character string in the FIFO queue based on the regular expression;
if the character string matched with the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;
if the character strings matched with the regular expression do not exist in the FIFO queue, respectively calculating the tag distance value of each character string in the FIFO queue and the table tag in the coding block to which the character string belongs;
and outputting the character string with the minimum label distance value as the table description information associated with the table.
Optionally, the acquiring apparatus of financial data further includes:
and the loading unit is used for loading a report template and respectively importing each item of financial data into a corresponding table body according to a preset table header in the report template.
And the generating unit is used for generating and displaying the financial data analysis report according to the import result.
Fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 7, the terminal device 7 of this embodiment includes a processor 70 and a memory 71, and a computer program 72, such as a financial data acquisition program, operable on the processor 70 is stored in the memory 71. The processor 70, when executing the computer program 72, implements the steps of the above-described embodiments of the method for acquiring financial data, such as the steps 101 to 106 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the units 61 to 66 shown in fig. 6.
Illustratively, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 72 in the terminal device 7.
The terminal device 7 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a terminal device 7 and does not constitute a limitation of the terminal device 7 and may comprise more or less components than shown, or some components may be combined, or different components, for example the terminal device may further comprise input output devices, network access devices, buses, etc.
The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 7. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used for storing the computer program and other programs and data required by the terminal device. The memory 71 may also be used to temporarily store data that has been output or is to be output.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (8)

1. A method for acquiring financial data, comprising:
acquiring a pre-published text to be analyzed, wherein the initial format of the text to be analyzed is a portable document pdf format;
converting the text format of the text to be analyzed from the pdf format into a document doc format through a preset text conversion tool;
acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text code comprises a plurality of types of page tags;
searching a table tag in the page tag, and positioning a table in the text to be analyzed according to the text position to which the table tag belongs;
extracting each field value associated with the table and table description information;
creating a first-in first-out (FIFO) queue;
sequentially traversing each coding block in the text codes, and acquiring the page tag type corresponding to the currently traversed coding block;
if the page tag type corresponding to the coding block is a paragraph type, sequentially storing each character contained in the coding block into the FIFO queue, and reading the real-time queue length of the FIFO queue;
if the real-time queue length of the FIFO queue is larger than a preset threshold value, removing a plurality of characters at the bottom of the FIFO queue, returning to execute the operation of sequentially traversing each coding block in the text codes and acquiring the page tag type corresponding to the currently traversed coding block;
if the page tag type corresponding to the coding block is a table type, splicing all characters in the FIFO queue, and outputting a splicing result as table description information associated with the table;
and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed.
2. The method for acquiring financial data according to claim 1, wherein said searching for a table tag in said page tags and locating a table existing in said text to be analyzed according to a text position to which said table tag belongs comprises:
sequentially traversing each coding block in the text codes;
for each coding block, judging whether the page tag type corresponding to the coding block is a table type;
if the page tag type corresponding to the coding block is a table type, setting the attribute value of the built-in flag bit as a logic true value so as to mark the text position corresponding to the coding block as the initial position of the table;
and returning to execute the operation of sequentially traversing each coding block in the text codes until the page tag type corresponding to the taken-out coding block is a non-table type and a non-null value, and marking the text position corresponding to the coding block as the end position of the table.
3. The method according to claim 2, wherein said splicing characters in said FIFO queue and outputting a result of splicing as table description information associated with said table if the page tag type corresponding to said coding block is a table type comprises:
if the page tag type corresponding to the coding block is a table type, acquiring a regular expression associated with a preset keyword;
detecting each character string in the FIFO queue based on the regular expression;
if the character string matched with the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;
if the character strings matched with the regular expression do not exist in the FIFO queue, respectively calculating the tag distance value of each character string in the FIFO queue and the table tag in the coding block to which the character string belongs;
and outputting the character string with the minimum label distance value as the table description information associated with the table.
4. The method for acquiring financial data according to claim 1, wherein after said outputting said form description information and each of said field values to a pre-created text document to make a business system perform recognition processing on said text document, acquiring financial data associated with said text to be analyzed, further comprises:
loading a report template, and respectively importing each financial data into a corresponding table body according to a preset table header in the report template;
and generating and displaying a financial data analysis report according to the import result.
5. A terminal device comprising a memory and a processor, the memory having stored therein a computer program operable on the processor, wherein the processor when executing the computer program implements the steps of:
acquiring a pre-published text to be analyzed, wherein the initial format of the text to be analyzed is a portable document pdf format;
converting the text format of the text to be analyzed from the pdf format into a document doc format through a preset text conversion tool;
acquiring a text code corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text code comprises a plurality of types of page tags;
searching a table tag in the page tag, and positioning a table in the text to be analyzed according to the text position to which the table tag belongs;
extracting each field value associated with the table and table description information;
creating a first-in first-out (FIFO) queue;
sequentially traversing each coding block in the text codes, and acquiring the page tag type corresponding to the currently traversed coding block;
if the page tag type corresponding to the coding block is a paragraph type, sequentially storing each character contained in the coding block into the FIFO queue, and reading the real-time queue length of the FIFO queue;
if the real-time queue length of the FIFO queue is larger than a preset threshold value, removing a plurality of characters at the bottom of the FIFO queue, returning to execute the operation of sequentially traversing each coding block in the text codes and acquiring the page tag type corresponding to the currently traversed coding block;
if the page tag type corresponding to the coding block is a table type, splicing all characters in the FIFO queue, and outputting a splicing result as table description information associated with the table;
and outputting the form description information and each field value to a pre-created text document so that a business system identifies the text document and acquires financial data associated with the text to be analyzed.
6. The terminal device of claim 5, wherein the searching for the table tag in the page tags and locating the table existing in the text to be analyzed according to the text position to which the table tag belongs comprises:
sequentially traversing each coding block in the text codes;
for each coding block, judging whether the page tag type corresponding to the coding block is a table type;
if the page tag type corresponding to the coding block is a table type, setting the attribute value of the built-in flag bit as a logic true value so as to mark the text position corresponding to the coding block as the initial position of the table;
and returning to execute the operation of sequentially traversing each coding block in the text codes until the page tag type corresponding to the taken-out coding block is a non-table type and a non-null value, and marking the text position corresponding to the coding block as the end position of the table.
7. The terminal device according to claim 5, wherein if the page tag type corresponding to the coding block is a table type, the splicing characters in the FIFO queue and outputting a splicing result as table description information associated with the table comprises:
if the page tag type corresponding to the coding block is a table type, acquiring a regular expression associated with a preset keyword;
detecting each character string in the FIFO queue based on the regular expression;
if the character string matched with the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;
if the character strings matched with the regular expression do not exist in the FIFO queue, respectively calculating the tag distance value of each character string in the FIFO queue and the table tag in the coding block to which the character string belongs;
and outputting the character string with the minimum label distance value as the table description information associated with the table.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN201810600697.4A 2018-06-12 2018-06-12 Financial data acquisition method, terminal device and medium Active CN109062874B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810600697.4A CN109062874B (en) 2018-06-12 2018-06-12 Financial data acquisition method, terminal device and medium
PCT/CN2018/105532 WO2019237540A1 (en) 2018-06-12 2018-09-13 Method and device for acquiring financial data, terminal device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810600697.4A CN109062874B (en) 2018-06-12 2018-06-12 Financial data acquisition method, terminal device and medium

Publications (2)

Publication Number Publication Date
CN109062874A CN109062874A (en) 2018-12-21
CN109062874B true CN109062874B (en) 2022-03-04

Family

ID=64820303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810600697.4A Active CN109062874B (en) 2018-06-12 2018-06-12 Financial data acquisition method, terminal device and medium

Country Status (2)

Country Link
CN (1) CN109062874B (en)
WO (1) WO2019237540A1 (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871524B (en) * 2019-02-21 2023-06-09 腾讯科技(深圳)有限公司 Chart generation method and device
CN110263311A (en) * 2019-05-22 2019-09-20 中国平安财产保险股份有限公司 A kind of generation method and equipment of Webpage
CN110334331A (en) * 2019-05-30 2019-10-15 重庆金融资产交易所有限责任公司 Method, apparatus and computer equipment based on order models screening table
CN110188107B (en) * 2019-06-05 2020-05-01 中科鼎富(北京)科技发展有限公司 Method and device for extracting information from table
CN110297905A (en) * 2019-06-27 2019-10-01 郑州铁路职业技术学院 A kind of computer system for economic management analysis data
CN110909112B (en) * 2019-10-18 2022-08-26 深圳价值在线信息科技股份有限公司 Data extraction method, device, terminal equipment and medium
CN110909123B (en) * 2019-10-23 2023-08-25 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium
CN112287660A (en) * 2019-12-04 2021-01-29 上海柯林布瑞信息技术有限公司 Method and device for analyzing table in PDF file, computing equipment and storage medium
CN111027285B (en) * 2019-12-17 2023-06-16 南京上游软件有限公司 Method and system for automatically extracting order information from pdf format order
CN111401058B (en) * 2020-03-12 2023-05-02 广州大学 Attribute value extraction method and device based on named entity recognition tool
CN111367988A (en) * 2020-03-31 2020-07-03 中国建设银行股份有限公司 Data import method and device
CN111476015B (en) * 2020-04-10 2024-01-05 北京字节跳动网络技术有限公司 Document processing method and device, electronic equipment and storage medium
CN111562965B (en) * 2020-04-27 2024-01-05 深圳手回科技集团有限公司 Page data verification method and device based on decision tree
CN111538750A (en) * 2020-06-24 2020-08-14 深圳壹账通智能科技有限公司 Information restoration method and device, computer system and readable storage medium
CN112035412A (en) * 2020-08-31 2020-12-04 北京奇虎鸿腾科技有限公司 Data file importing method, device, storage medium and device
CN112214987B (en) * 2020-09-08 2023-02-03 深圳价值在线信息科技股份有限公司 Information extraction method, extraction device, terminal equipment and readable storage medium
CN112100366B (en) * 2020-09-17 2023-10-27 广联达科技股份有限公司 Pavement structure layer display method and device, computer equipment and storage medium
CN112434096B (en) * 2020-11-30 2023-05-23 上海天旦网络科技发展有限公司 Intelligent tag-based service analysis system and method
CN112597353B (en) * 2020-12-18 2024-03-08 武汉大学 Text information automatic extraction method
CN112699637B (en) * 2021-01-08 2024-04-12 中南大学 Paragraph type recognition method and system and document structure recognition method and system
CN112949476B (en) * 2021-03-01 2023-09-29 苏州美能华智能科技有限公司 Text relation detection method, device and storage medium based on graph convolution neural network
CN113988011A (en) * 2021-08-19 2022-01-28 中核核电运行管理有限公司 Document content identification method and device
CN113761044A (en) * 2021-08-30 2021-12-07 上海快确信息科技有限公司 Labeling system method for labeling text into table
CN113872963B (en) * 2021-09-26 2023-09-29 中水北方勘测设计研究有限责任公司 Method and system for rapidly analyzing message protocol based on free label splicing technology
CN114428839A (en) * 2022-01-27 2022-05-03 北京百度网讯科技有限公司 Data processing method, paragraph text determination device and electronic equipment
CN114692792B (en) * 2022-03-22 2022-11-04 深圳市利和兴股份有限公司 Makeup radio frequency identification testing platform
CN115545008B (en) * 2022-11-29 2023-04-07 明度智云(浙江)科技有限公司 Spectrogram file analyzing method, device, equipment and storage medium
CN117010349B (en) * 2023-09-28 2023-12-19 杭州今元标矩科技有限公司 Form filling method, system and storage medium based on neural network model
CN117350264B (en) * 2023-12-04 2024-02-23 税友软件集团股份有限公司 PPT file generation method, device, equipment and storage medium
CN117593752B (en) * 2024-01-18 2024-04-09 星云海数字科技股份有限公司 PDF document input method, PDF document input system, storage medium and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1614534A (en) * 2003-11-04 2005-05-11 北京华安天诚科技有限公司 Handwritten flying data displaying and inputting apparatus and method for air communication control
CN101360100A (en) * 2008-09-16 2009-02-04 浙江汇信科技有限公司 Digital signing, sealing and authenticating method for PDF document
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103605349A (en) * 2013-11-26 2014-02-26 厦门雅迅网络股份有限公司 Remote data real-time collection, analysis and statistical system and method based on CAN-bus
CN104199975A (en) * 2014-09-23 2014-12-10 中国南方电网有限责任公司 Configurable WORD file structured extraction method
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN107689070A (en) * 2017-08-31 2018-02-13 平安科技(深圳)有限公司 Chart data structuring extracting method, electronic equipment and computer-readable recording medium
CN107818075A (en) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 Form data structuring extracting method, electronic equipment and computer-readable recording medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976232B (en) * 2010-09-19 2012-06-20 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN102855243A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for extracting document structure
US9536141B2 (en) * 2012-06-29 2017-01-03 Palo Alto Research Center Incorporated System and method for forms recognition by synthesizing corrected localization of data fields
CN106484663B (en) * 2016-10-12 2019-05-03 天闻数媒科技(湖南)有限公司 A kind of extracting method and device of document content
CN106897690B (en) * 2017-02-22 2018-04-13 南京述酷信息技术有限公司 PDF table extracting methods

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1614534A (en) * 2003-11-04 2005-05-11 北京华安天诚科技有限公司 Handwritten flying data displaying and inputting apparatus and method for air communication control
CN101360100A (en) * 2008-09-16 2009-02-04 浙江汇信科技有限公司 Digital signing, sealing and authenticating method for PDF document
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103605349A (en) * 2013-11-26 2014-02-26 厦门雅迅网络股份有限公司 Remote data real-time collection, analysis and statistical system and method based on CAN-bus
CN104199975A (en) * 2014-09-23 2014-12-10 中国南方电网有限责任公司 Configurable WORD file structured extraction method
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN107689070A (en) * 2017-08-31 2018-02-13 平安科技(深圳)有限公司 Chart data structuring extracting method, electronic equipment and computer-readable recording medium
CN107818075A (en) * 2017-10-16 2018-03-20 平安科技(深圳)有限公司 Form data structuring extracting method, electronic equipment and computer-readable recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Python编程从0到1(实战篇:提取Word表格存储到Excel);安和然;《https://www.jianshu.com/p/d7cf0e6c5c98》;20180328;1-2 *

Also Published As

Publication number Publication date
CN109062874A (en) 2018-12-21
WO2019237540A1 (en) 2019-12-19

Similar Documents

Publication Publication Date Title
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN108874928B (en) Resume data information analysis processing method, device, equipment and storage medium
CN110083805B (en) Method and system for converting Word file into EPUB file
US8140468B2 (en) Systems and methods to extract data automatically from a composite electronic document
US9025890B2 (en) Information classification device, information classification method, and information classification program
US8892579B2 (en) Method and system of data extraction from a portable document format file
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
WO2019080402A1 (en) Text information extraction method for structured text, storage medium and server
US10042880B1 (en) Automated identification of start-of-reading location for ebooks
WO2019028990A1 (en) Code element naming method, device, electronic equipment and medium
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
US9772991B2 (en) Text extraction
CN112214987B (en) Information extraction method, extraction device, terminal equipment and readable storage medium
CN112395851A (en) Text comparison method and device, computer equipment and readable storage medium
CN115687655A (en) PDF document-based knowledge graph construction method, system, equipment and storage medium
CN110209759B (en) Method and device for automatically identifying page
CN111160445B (en) Bid file similarity calculation method and device
CN107145947B (en) Information processing method and device and electronic equipment
CN115759029A (en) Document template processing method and device, electronic equipment and storage medium
CN110874398B (en) Forbidden word processing method and device, electronic equipment and storage medium
CN110909112B (en) Data extraction method, device, terminal equipment and medium
CN105320716A (en) Automatic labeling method for digital publication
US9311392B2 (en) Document analysis apparatus, document analysis method, and computer-readable recording medium
CN114743012B (en) Text recognition method and device
CN113255369B (en) Text similarity analysis method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant