WO2019237540A1 - Method and device for acquiring financial data, terminal device, and medium - Google Patents
Method and device for acquiring financial data, terminal device, and medium Download PDFInfo
- Publication number
- WO2019237540A1 WO2019237540A1 PCT/CN2018/105532 CN2018105532W WO2019237540A1 WO 2019237540 A1 WO2019237540 A1 WO 2019237540A1 CN 2018105532 W CN2018105532 W CN 2018105532W WO 2019237540 A1 WO2019237540 A1 WO 2019237540A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- coding block
- analyzed
- fifo queue
- encoding
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/06—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/12—Accounting
- G06Q40/125—Finance or payroll
Definitions
- the present application belongs to the technical field of data processing, and particularly relates to a method, an apparatus, a terminal device, and a computer-readable storage medium for acquiring financial data.
- Documents such as quarterly reports, annual reports and prospectuses are public documents of the enterprise.
- Public documents contain a lot of valuable financial data. For example, corporate accounts receivable, accounts payable, income and expenditure status, profit and loss amounts, and overall debt status. After reprocessing and analysis of these financial data, they can show great reference value. For example, in various applications, these financial data can be used to independently analyze the operating status of an enterprise and determine the status of the industrial chain of the industry to which the enterprise is associated.
- embodiments of the present application provide a method, an apparatus, a terminal device, and a medium for acquiring financial data, so as to solve the problem that multiple-dimensional acquisition of financial data cannot be achieved in the prior art.
- a first aspect of the embodiments of the present application provides a method for acquiring financial data, including:
- an initial format of the text to be analyzed is a portable document pdf format
- a second aspect of the embodiments of the present application provides an apparatus for acquiring financial data
- the monitoring apparatus includes a unit for executing the method for acquiring financial data described in the first aspect.
- a third aspect of the embodiments of the present application provides a terminal device including a memory and a processor.
- the memory stores computer-readable instructions executable on the processor, and the processor executes the computer-readable instructions.
- the steps of the method for obtaining financial data as described in the first aspect are implemented when the instruction is read.
- a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented as described in the first aspect when executed by a processor. Steps in the method of obtaining financial data.
- the public documents such as the prospectus, annual report and quarterly report obtained in the original loading exist in the pdf format
- the text to be analyzed can be read.
- Corresponding text encoding so as to determine the location area to which the form belongs according to the form label in the text encoding, to realize the automatic positioning of the form; in the above public documents, the data information contained in the form is usually of high mining value Financial data.
- FIG. 1 is an implementation flowchart of a method for acquiring financial data provided by an embodiment of the present application
- FIG. 2 is a detailed implementation flowchart of a method S104 for obtaining financial data according to an embodiment of the present application
- FIG. 3 is a detailed implementation flowchart of a method S105 for obtaining financial data provided by an embodiment of the present application
- FIG. 4 is another specific implementation flowchart of a method S105 for obtaining financial data according to an embodiment of the present application.
- FIG. 5 is an implementation flowchart of a method for acquiring financial data provided by another embodiment of the present application.
- FIG. 6 is a structural block diagram of an apparatus for acquiring financial data provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
- FIG. 1 illustrates an implementation flow of a method for acquiring financial data provided by an embodiment of the present application.
- the method flow includes steps S101 to S106.
- the specific implementation principle of each step is as follows:
- S101 Obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.
- the texts to be analyzed are public documents issued by the enterprise, including quarterly reports, annual reports, and prospectuses. Download the text to be analyzed from the corresponding public website regularly according to preset website information.
- PDF Portable Document Format
- S102 Convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.
- the text conversion tool may be, for example, a Foxit converter, a PDF converter, and a quick converter.
- S103 Obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags.
- Text encoding contains many types of page tags, such as table tags and paragraph paragraph tags.
- S104 Find a form label in the page label, and locate a form existing in the text to be analyzed according to a text position to which the form label belongs.
- the text encoding corresponding to the text to be analyzed is traversed to sequentially detect various types of page tags appearing in the text encoding through a preset regular expression. And, among the detected page tags, each form tag is located based on a tag character element corresponding to the form tag.
- any table label in the text to be analyzed is located, it is determined that the text code adjacent to the table label is a text code that matches a table in the text to be analyzed. Therefore, according to the text position to which the table label belongs, The position of the table in the text to be analyzed can be determined.
- FIG. 2 shows a specific implementation process of the method S104 for obtaining financial data provided by an embodiment of the present application, which is detailed as follows:
- S1042 For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type.
- S1044 Return and execute the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, and the text corresponding to the encoding block is The position is marked as the end position of the table.
- the text encoding includes a plurality of encoding blocks, and each block has a corresponding page label.
- each block in the text encoding is read in turn.
- the page tag type of each block is determined. If the page tag corresponding to the block is a table tag, it is determined that the page tag type of the block is a table type; if the page tag corresponding to the block is a paragraph tag, the page tag type of the block is determined to be a paragraph type.
- the attribute value of the start_table flag bit of the text position is set to a logical true value of true to Mark the text position as the starting position of a table currently detected. After that, return to step S1041 to find the next block existing in the text encoding from the current text position, and execute the subsequent steps S1042 to S1044.
- the page tag type is non-table type (for example, it may be a paragraph type)
- the value of the flag bit end_table of the text position to which the block belongs is set to a logical true value to mark the text position as the end position of a table currently detected.
- the first text position where the start_table flag is true and the second text position where the end_table flag which appears for the first time after the first text is set to true are determined as and The text area corresponding to a table.
- the embodiment of the present application is applicable to a scenario in which a page display table exists in the text to be analyzed.
- a page display table exists in the text to be analyzed.
- the table will be displayed across pages, that is, the table is divided into at least two sub-tables, so that each sub-table is displayed separately On a page of text to analyze. Therefore, after converting the text format of the text to be analyzed to doc format, in order to be able to restore the same table based on different blocks in the text encoding, it can be determined when the page tag types of both blocks are continuously monitored as table types.
- the text positions to which the two blocks belong are the position areas where the table exists.
- the attribute value of the built-in flag bit corresponding to each text position can be determined, so as to accurately identify the content in the text to be analyzed based on each attribute value.
- the starting and ending positions of the existing forms thereby realizing automatic identification of the forms displayed on the page, so that various financial data can be classified under the same form after being extracted, thereby improving the accuracy of the form data extraction.
- the form description information is used to describe the main content of the form data, including but not limited to the title, name, or descriptive information of the form.
- the table data is the financial expenditure data of Enterprise A in March
- the table description information may be "March fiscal expenditure data".
- multiple character values before the location area or after the location area may be extracted to determine it as the table description information of the table.
- FIG. 3 shows a specific implementation process of the method S105 for obtaining financial data provided in the embodiment of the present application, which is detailed as follows:
- S10501 Create a FIFO queue.
- S10502 traverse each coding block in the text encoding in sequence, and obtain the page tag type corresponding to the currently traversed coding block.
- S10503 If the type of the page tag corresponding to the encoding block is a paragraph type, sequentially store each character contained in the encoding block into the FIFO queue, and read the real-time queue length of the FIFO queue.
- S10504 if the real-time queue length of the FIFO queue is greater than a preset threshold, remove a plurality of the characters existing at the bottom of the FIFO queue, and return to execute the sequential traversal of each encoding block in the text encoding and obtain The operation of the page label type corresponding to the currently traversed coding block.
- FIFO First Input First Output
- the real-time queue length of the FIFO queue is obtained according to the number of characters contained in the FIFO queue. If the real-time queue length is greater than the preset queue length value, it indicates that the FIFO queue is full. Therefore, the data that enters the FIFO queue first is eliminated, so as to push the currently read block cell content into the processed FIFO. In the queue. Thereafter, return to and execute the above S1052, and when the page label type of the read block is a table type, stop pushing the cell content of any block into the FIFO queue.
- each character contained in the FIFO queue is extracted, and a character string obtained by splicing each character is output as table description information associated with a table.
- each character stored in the FIFO queue is the text information closest to the table location area.
- the text information closest to the location area of the table can best reflect the main content of the table data (for example, the header information at the top of the table)
- stitching the characters in the FIFO queue and outputting the result of the splicing For the table description information associated with the table, automatic positioning of the table description information is achieved, and the accuracy of extracting the table description information is improved.
- FIG. 4 shows another specific implementation process of the method S105 for obtaining financial data provided in the embodiment of the present application, which is detailed as follows:
- S10507 Perform detection processing on each character string in the FIFO queue based on the regular expression.
- S10510 Output one of the character strings with the smallest tag distance value as table description information associated with the table.
- extracting the table description information associated with the table based on the text information before the table specifically includes: after the cell content of the block whose page label type is the table type is pushed into the FIFO queue, obtaining and presetting A regular expression associated with the associated word.
- the preset related words are characters having a high degree of relevance to the descriptive information of the table such as the table title.
- common table titles usually exist in the format of "XXX table”, so the regular expression corresponding to the class table title can be "[ ⁇ s ⁇ S] * ⁇ ⁇ $".
- each string stored in the FIFO queue is detected and processed.
- the character string is extracted and output as the table description information associated with the table.
- N is a preset value, and N is an integer greater than 1.
- the character is a character string.
- the style tag of the block to which the last character belongs read the tag distance value of the block.
- the label distance value indicates the distance between the text position of the character and the bottom of the current page. Based on this method, after obtaining the tag distance value of each character string in the FIFO queue, a character string with the smallest tag distance value is selected. A string with the smallest tag distance value is output as table description information associated with the table.
- the text position to which the string belongs also corresponds to the start of the table.
- the starting position is the closest.
- the text information closest to the starting position of the table can more clearly describe the subject content of the table data. Therefore, by outputting this string as the table description information associated with the table, the table is also improved to a certain extent Describe the accuracy of the information.
- S106 Output the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
- the form description information and each field value are sequentially output to a pre-created text document .
- the text format of the text document is txt format.
- a preset separator is inserted between any two adjacent field values.
- the form description information is output at the top position of the above text document, and a line break is inserted between the form description information and the field value.
- the text document is sent to each service system connected in advance. Because the business systems of each version type have better compatibility with text files in txt format, the business system can identify and process the text files to extract the financial data associated with the text to be analyzed.
- the embodiment of the present application realizes the rapid analysis of corporate financial data, avoids the need to read corporate financial data based on public files of complex styles, thereby reducing the difficulty of obtaining corporate financial data; because the business system can automatically use the above text files to automatically Identifying the financial data contained in various types of public documents, compared to the prior art, multi-dimensional acquisition of financial data has also been achieved.
- the method further includes:
- S107 Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template.
- a pre-generated report template is loaded, the report template includes various headers, each header corresponds to a body, and each header is used to describe a field attribute of a field value in the form, Each table body is used to record a field value.
- the field value corresponding to the field attribute is filtered, and the The field values are imported into the table body corresponding to the header of the report template.
- each statistical information value is calculated through a preset calculation formula to import the obtained statistical results to the footer of the report template, and then output and display the financial data Analyze the report.
- the field values in the text document are imported into a pre-generated report template, so that the final financial data analysis report can list the field values in the data analysis process in detail, which is convenient for users to check the analysis of financial data Whether the process is wrong, thereby further improving the reliability and accuracy of financial data analysis reports.
- FIG. 6 shows a structural block diagram of a device for acquiring financial data provided by an embodiment of the present application. For convenience of explanation, only the relevant data of the embodiment of the present application are shown. section.
- the device includes:
- the first obtaining unit 61 is configured to obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.
- the conversion unit 62 is configured to convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.
- the second obtaining unit 63 is configured to obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format, where the text encoding includes multiple types of page tags.
- the searching unit 64 is configured to search for a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs.
- the extraction unit 65 is configured to extract various field values and table description information associated with the table.
- An output unit 66 is configured to output the form description information and each of the field values to a pre-created text document, so that the business system obtains the text document to be analyzed after the text document is identified. Financial data.
- the search unit 64 includes:
- the traversing subunit is used to sequentially traverse each coding block in the text coding.
- the judging subunit is configured to judge, for each of the coding blocks, whether a page tag type corresponding to the coding block is a table type.
- a marking subunit configured to set the attribute value of the built-in flag bit to a logical truth value if the page tag type corresponding to the coding block is a table type, so as to mark the text position corresponding to the coding block as the start of the table position.
- a return subunit for returning to perform the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, the encoding block The corresponding text position is marked as the end position of the table.
- the extraction unit 65 includes:
- An acquisition subunit is configured to sequentially traverse each coding block in the text encoding, and obtain a page tag type corresponding to the currently traversed coding block.
- a storage subunit configured to sequentially store each character contained in the encoding block into the FIFO queue if the page tag type corresponding to the encoding block is a paragraph type, and read the real-time of the FIFO queue The queue length.
- a removing subunit configured to remove a plurality of the characters existing at the bottom of the FIFO queue if the real-time queue length of the FIFO queue is greater than a preset threshold, and return to execute each of the text encoding in turn An operation of encoding a block and obtaining a page tag type corresponding to the currently traversed encoding block.
- the splicing subunit is configured to splice each character in the FIFO queue if the page tag type corresponding to the coding block is a table type, and output the splicing result as table description information associated with the table.
- the splicing subunit is specifically configured to: if the page tag type corresponding to the coding block is a table type, obtain a regular expression associated with a preset keyword;
- the apparatus for acquiring financial data further includes: a loading unit for loading a report template, and importing each of the financial data into a corresponding table according to a pre-set header in the report template. Body.
- a generating unit is used to generate and display financial data analysis reports based on the import results.
- FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
- the terminal device 7 of this embodiment includes a processor 70 and a memory 71.
- the memory 71 stores computer-readable instructions 72 that can be run on the processor 70, such as a program for acquiring financial data.
- the processor 70 executes the computer-readable instructions 72
- the steps in the embodiment of the method for acquiring financial data are implemented, for example, steps 101 to 106 shown in FIG.
- the processor 70 executes the computer-readable instructions 72
- the functions of the modules / units in the foregoing device embodiments are implemented, for example, the functions of the units 61 to 66 shown in FIG. 6.
- the computer-readable instructions 72 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 71 and executed by the processor 70, To complete this application.
- the one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 72 in the terminal device 7.
- the terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- the terminal device may include, but is not limited to, a processor 70 and a memory 71.
- FIG. 7 is only an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. It may include more or fewer components than shown in the figure, or combine some components or different components.
- the terminal device may further include an input / output device, a network access device, a bus, and the like.
- the processor 70 may be a central processing unit (Central Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (Application Specific Integrated Circuits) Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- CPU Central Processing Unit
- DSP Digital Signal Processor
- ASIC Application Specific Integrated Circuits
- FPGA off-the-shelf Programmable Gate Array
- a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7.
- the memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD) provided on the terminal device 7. Card, flash card, etc. Further, the memory 71 may further include both an internal storage unit of the terminal device 7 and an external storage device.
- the memory 71 is configured to store the computer-readable instructions and other programs and data required by the terminal device.
- the memory 71 may also be used to temporarily store data that has been output or is to be output.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
- the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
- the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
- the foregoing storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks, or compact discs, and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- General Engineering & Computer Science (AREA)
- Accounting & Taxation (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Technology Law (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
A method and device for acquiring financial data, a terminal device, and a medium, applicable in the technical field of data processing, reducing the difficulty in acquiring corporate financial data, and effecting a multidimensional acquisition of financial data. The method comprises: acquiring pre-released text to be analyzed, the initial format of said text is the Portable Document Format (PDF) (S101); converting said text from PDF to the DOC format via a preset text conversion tool (S102); acquiring a text encoding corresponding to said text on the basis of said text in the DOC format, where the text encoding comprises multiple types of page labels (S103); searching for a table label in the page labels and positioning, on the basis of the position of text pertaining to the text label, a table in said text (S104); extracting field values associated with the table and table description information (S105); outputting the table description information and the field values to a pre-created text file, thus allowing a service system to perform a recognition processing with respect to the text file and then to acquire financial data associated with said text (S106).
Description
本申请要求于2018年06月12日提交中国专利局、申请号为201810600697.4 、发明名称为“财政数据的获取方法、终端设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on June 12, 2018 with the Chinese Patent Office, application number 201810600697.4, and the invention name is "Methods for Obtaining Financial Data, Terminal Equipment and Media", the entire contents of which are incorporated by reference In this application.
本申请属于数据处理技术领域,尤其涉及一种财政数据的获取方法、装置、终端设备及计算机可读存储介质。The present application belongs to the technical field of data processing, and particularly relates to a method, an apparatus, a terminal device, and a computer-readable storage medium for acquiring financial data.
季报、年报以及招股书等文件均为企业的公开文件。公开文件中包含了很多有价值的财政数据。例如,企业应收账款、应付账款、收支状况、损益金额以及整体债务状况等。这些财政数据经过再次加工以及分析处理后,可体现出极大的参考价值。例如,在各种应用场合中,这些财政数据可用于独立分析企业的经营状况、确定企业所关联的行业产业链状况等。Documents such as quarterly reports, annual reports and prospectuses are public documents of the enterprise. Public documents contain a lot of valuable financial data. For example, corporate accounts receivable, accounts payable, income and expenditure status, profit and loss amounts, and overall debt status. After reprocessing and analysis of these financial data, they can show great reference value. For example, in various applications, these financial data can be used to independently analyze the operating status of an enterprise and determine the status of the industrial chain of the industry to which the enterprise is associated.
然而,由于季报、年报以及招股书等公开文件的样式均较为复杂,故业界也暂时未公开要对这些公开文件进行财政数据的自动提取及分析处理,因此,无法实现财政数据的多维度获取。However, due to the complexity of public documents such as quarterly reports, annual reports, and prospectuses, the industry has not yet disclosed the need to automatically extract and analyze financial data from these public documents. Therefore, multi-dimensional acquisition of financial data cannot be achieved.
有鉴于此,本申请实施例提供了一种财政数据的获取方法、装置、终端设备及介质,以解决现有技术中无法实现财政数据的多维度获取的问题。In view of this, embodiments of the present application provide a method, an apparatus, a terminal device, and a medium for acquiring financial data, so as to solve the problem that multiple-dimensional acquisition of financial data cannot be achieved in the prior art.
本申请实施例的第一方面提供了一种财政数据的获取方法,包括:A first aspect of the embodiments of the present application provides a method for acquiring financial data, including:
获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式;Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;
通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式;Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;
基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签;Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;
查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格;Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;
提取与所述表格关联的各个字段值以及表格描述信息;Extracting each field value and form description information associated with the form;
将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
本申请实施例的第二方面提供了一种财政数据的获取装置,所述监控装置包括用于执行上述第一方面所述的财政数据的获取方法的单元。A second aspect of the embodiments of the present application provides an apparatus for acquiring financial data, and the monitoring apparatus includes a unit for executing the method for acquiring financial data described in the first aspect.
本申请实施例的第三方面提供了一种终端设备,包括存储器以及处理器,所述存储器中存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如第一方面所述的财政数据的获取方法的步骤。A third aspect of the embodiments of the present application provides a terminal device including a memory and a processor. The memory stores computer-readable instructions executable on the processor, and the processor executes the computer-readable instructions. The steps of the method for obtaining financial data as described in the first aspect are implemented when the instruction is read.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如第一方面所述的财政数据的获取方法的步骤。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are implemented as described in the first aspect when executed by a processor. Steps in the method of obtaining financial data.
本申请实施例中,由于原始加载得到的招股书、年报以及季报等公开文件都是以pdf格式存在的,故通过将这些公开文件的文本格式转换为doc格式,能够读取出待分析文本所对应的文本编码,从而根据文本编码中的表格标签来确定出表格所属的位置区域,实现了对表格的自动化定位;上述公开文件中,表格所包含的数据信息通常都是具有较高挖掘价值的财政数据,因此,在定位得到各表格位置后,通过提取出与表格关联的字段值以及表格描述信息,将其输出至预先创建的文本文档,保证了其他业务系统都能够对兼容性较强的文本文档进行读取以及进行分析处理,从而实现了对企业财政数据的快速分析,避免了需要基于复杂样式的公开文件来读取企业财政数据,故降低了企业财政数据的获取难度;由于业务系统可以通过上述文本文档来自动识别各类公开文件所包含的财政数据,故相对于现有技术来说,还达到了财政数据的多维度获取效果。In the embodiment of the present application, since the public documents such as the prospectus, annual report and quarterly report obtained in the original loading exist in the pdf format, by converting the text format of these public documents to the doc format, the text to be analyzed can be read. Corresponding text encoding, so as to determine the location area to which the form belongs according to the form label in the text encoding, to realize the automatic positioning of the form; in the above public documents, the data information contained in the form is usually of high mining value Financial data. Therefore, after locating the positions of each table, by extracting the field values associated with the table and the table description information, and outputting it to a pre-created text file, it is guaranteed that other business systems can respond to the strong compatibility The text document is read and analyzed, thereby achieving rapid analysis of corporate financial data, avoiding the need to read corporate financial data based on public files of complex styles, thereby reducing the difficulty of obtaining corporate financial data; due to the business system Various types of public documents can be automatically identified through the above text documents Financial data file contains, with respect to the prior art it is further achieved multi-dimensional data acquisition financial effects.
图1是本申请实施例提供的财政数据的获取方法的实现流程图;FIG. 1 is an implementation flowchart of a method for acquiring financial data provided by an embodiment of the present application;
图2是本申请实施例提供的财政数据的获取方法S104的具体实现流程图;FIG. 2 is a detailed implementation flowchart of a method S104 for obtaining financial data according to an embodiment of the present application; FIG.
图3是本申请实施例提供的财政数据的获取方法S105的具体实现流程图;FIG. 3 is a detailed implementation flowchart of a method S105 for obtaining financial data provided by an embodiment of the present application; FIG.
图4是本申请实施例提供的财政数据的获取方法S105的另一具体实现流程图;FIG. 4 is another specific implementation flowchart of a method S105 for obtaining financial data according to an embodiment of the present application; FIG.
图5是本申请另一实施例提供的财政数据的获取方法的实现流程图;FIG. 5 is an implementation flowchart of a method for acquiring financial data provided by another embodiment of the present application; FIG.
图6是本申请实施例提供的财政数据的获取装置的结构框图;6 is a structural block diagram of an apparatus for acquiring financial data provided by an embodiment of the present application;
图7是本申请实施例提供的终端设备的示意图。FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application.
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。In order to explain the technical solution described in this application, the following description is made through specific embodiments.
图1示出了本申请实施例提供的财政数据的获取方法的实现流程,该方法流程包括步骤S101至S106。各步骤的具体实现原理如下:FIG. 1 illustrates an implementation flow of a method for acquiring financial data provided by an embodiment of the present application. The method flow includes steps S101 to S106. The specific implementation principle of each step is as follows:
S101:获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式。S101: Obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.
本申请实施例中,待分析文本为企业所发布的公开文件,包括季报、年报以及招股书等。根据预设的网站信息,定期从对应的公开网站中下载上述待分析文本。其中,由于企业在创建上述公开文件时,均以可移植文档(Portable
Document Format,PDF)的格式进行输出,故从上述公开网站中所下载得到的待分析文本的格式均为PDF格式。In the embodiment of the present application, the texts to be analyzed are public documents issued by the enterprise, including quarterly reports, annual reports, and prospectuses. Download the text to be analyzed from the corresponding public website regularly according to preset website information. Among them, when companies create the above public documents, they use portable documents (Portable
Document Format (PDF) format for output, so the format of the text to be analyzed downloaded from the above public website is PDF format.
S102:通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式。S102: Convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.
对于每一个pdf格式的待分析文本,将其导入预设的文本转换工具,并在检测到用户发出的格式转换指令后,输出基于文档(document,doc)格式的待分析文件。上述文本转换工具例如可以是福昕转换器、PDF转换器以及迅捷转换器等。For each text to be analyzed in pdf format, import it into a preset text conversion tool, and after detecting the format conversion instruction sent by the user, output a file to be analyzed based on the document (doc) format. The text conversion tool may be, for example, a Foxit converter, a PDF converter, and a quick converter.
S103:基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签。S103: Obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags.
对于doc格式的待分析文本,读取该待分析文本的文本编码。文本编码中包含有多种类型的页面标签,例如table表格标签以及paragraph段落标签等。For the text to be analyzed in doc format, read the text encoding of the text to be analyzed. Text encoding contains many types of page tags, such as table tags and paragraph paragraph tags.
S104:查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格。S104: Find a form label in the page label, and locate a form existing in the text to be analyzed according to a text position to which the form label belongs.
本申请实施例中,遍历待分析文本所对应的文本编码,以通过预设的正则表达式,依次检测出出现于文本编码中的各类页面标签。并且,在检测得到的页面标签中,基于与表格标签对应的标签字符元素,定位各个表格标签。In the embodiment of the present application, the text encoding corresponding to the text to be analyzed is traversed to sequentially detect various types of page tags appearing in the text encoding through a preset regular expression. And, among the detected page tags, each form tag is located based on a tag character element corresponding to the form tag.
若定位得到待分析文本中的任一表格标签,则确定后邻于该表格标签的文本编码为与待分析文本中的一个表格相匹配的文本编码,因此,根据该表格标签所属的文本位置,可确定待分析文本中表格所对应的定位。If any table label in the text to be analyzed is located, it is determined that the text code adjacent to the table label is a text code that matches a table in the text to be analyzed. Therefore, according to the text position to which the table label belongs, The position of the table in the text to be analyzed can be determined.
作为本申请的一个实施例,图2示出了本申请实施例提供的财政数据的获取方法S104的具体实现流程,详述如下:As an embodiment of the present application, FIG. 2 shows a specific implementation process of the method S104 for obtaining financial data provided by an embodiment of the present application, which is detailed as follows:
S1041:依次遍历所述文本编码中的各个编码块。S1041: traverse each coding block in the text coding in sequence.
S1042:对每一所述编码块,判断该编码块所对应的页面标签类型是否为表格类型。S1042: For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type.
S1043:若该编码块所对应的页面标签类型为表格类型,则将内置标志位的属性值置为逻辑真值,以将该编码块所对应的文本位置标记为表格的起始位置。S1043: If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table.
S1044:返回执行所述依次遍历所述文本编码中的各个编码块的操作,直至取出的编码块所对应的页面标签类型为非表格类型且为非空值时,将该编码块所对应的文本位置标记为所述表格的结束位置。S1044: Return and execute the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, and the text corresponding to the encoding block is The position is marked as the end position of the table.
本申请实施例中,文本编码中包含有多个编码块(block),每一block具有其对应的页面标签。通过预设的Document python插件,依次对文本编码中的每一个block进行读取。根据页面标签的不同,确定出每一block的页面标签类型。若block所对应的页面标签为表格标签,则确定该block的页面标签类型为表格类型;若block所对应的页面标签为段落标签,则确定该block的页面标签类型为段落类型。In the embodiment of the present application, the text encoding includes a plurality of encoding blocks, and each block has a corresponding page label. Through the preset Document python plugin, each block in the text encoding is read in turn. According to the different page tags, the page tag type of each block is determined. If the page tag corresponding to the block is a table tag, it is determined that the page tag type of the block is a table type; if the page tag corresponding to the block is a paragraph tag, the page tag type of the block is determined to be a paragraph type.
本申请实施例中,若检测到任一block的页面标签类型为表格类型,则对于该block所属的文本位置,将该文本位置的start_table这一标志位的属性值置为逻辑真值true,以将该文本位置标记为当前所检测得到的一个表格的起始位置。此后,返回执行步骤S1041,以从当前的文本位置起,查找存在于文本编码中的下一block,并执行后续的步骤S1042至S1044。In the embodiment of the present application, if the page tag type of any block is detected as a table type, for the text position to which the block belongs, the attribute value of the start_table flag bit of the text position is set to a logical true value of true to Mark the text position as the starting position of a table currently detected. After that, return to step S1041 to find the next block existing in the text encoding from the current text position, and execute the subsequent steps S1042 to S1044.
在将上述文本位置的start_table标志位的属性值置为逻辑真值后,若检测到后续任一block存在对应的页面标签,且其页面标签类型为非表格类型(例如可能是段落类型),则将该block所属文本位置的end_table这一标志位的值置为逻辑真值true,以将该文本位置标记为当前所检测得到的一个表格的结束位置。After setting the attribute value of the start_table flag in the above text position to a logical truth value, if a corresponding page tag is detected in any subsequent block, and the page tag type is non-table type (for example, it may be a paragraph type) The value of the flag bit end_table of the text position to which the block belongs is set to a logical true value to mark the text position as the end position of a table currently detected.
根据待分析文本中各个文本位置所对应的标志位信息,将start_table标志位为true的第一文本位置以及在第一文本置为之后首次出现的end_table标志位为true的第二文本位置确定为与一表格对应的文本区域。According to the flag bit information corresponding to each text position in the text to be analyzed, the first text position where the start_table flag is true and the second text position where the end_table flag which appears for the first time after the first text is set to true are determined as and The text area corresponding to a table.
本申请实施例适用于待分析文本中存在有分页显示的表格的场景之下。例如,在pdf格式的待分析文本中,若某一表格的高度较大,则该表格将会跨页显示,即,将该表格分割成至少两个子表格后,使得每一子表格分别显示于待分析文本的一个页面中。因此,在将待分析文本的文本格式转换为doc格式后,为了能够基于文本编码中的不同block来还原同一张表格,在连续监测到两个block的页面标签类型均为表格类型时,可确定两个block所属的文本位置均为表格所存在的位置区域。若检测到下一block的页面标签类型为段落类型,则表示上述表格已终止,因此,基于该block所属的文本位置以及前面各个block所属的文本位置,可定位并提取出待分析文本中所存在的一个完整表格。The embodiment of the present application is applicable to a scenario in which a page display table exists in the text to be analyzed. For example, in the text to be analyzed in pdf format, if the height of a table is large, the table will be displayed across pages, that is, the table is divided into at least two sub-tables, so that each sub-table is displayed separately On a page of text to analyze. Therefore, after converting the text format of the text to be analyzed to doc format, in order to be able to restore the same table based on different blocks in the text encoding, it can be determined when the page tag types of both blocks are continuously monitored as table types. The text positions to which the two blocks belong are the position areas where the table exists. If the page tag type of the next block is detected as a paragraph type, it means that the above table is terminated. Therefore, based on the text position to which the block belongs and the text position to which each previous block belongs, it is possible to locate and extract the existence of the text to be analyzed. A complete form.
本申请实施例中,通过对待分析文本中各个编码块的表格类型进行检测,可确定出各个文本位置所对应的内置标志位的属性值,从而基于各个属性值来准确识别出待分析文本中所存在的表格的起止位置,由此实现了对分页显示的表格的自动识别,使得各项财政数据被提取后能够被归类至同一表格之下,故提高了表格数据提取的准确性。In the embodiment of the present application, by detecting the table type of each coding block in the text to be analyzed, the attribute value of the built-in flag bit corresponding to each text position can be determined, so as to accurately identify the content in the text to be analyzed based on each attribute value. The starting and ending positions of the existing forms, thereby realizing automatic identification of the forms displayed on the page, so that various financial data can be classified under the same form after being extracted, thereby improving the accuracy of the form data extraction.
S105:提取与所述表格关联的各个字段值以及表格描述信息。S105: Extract each field value and table description information associated with the table.
在定位出待分析文本所包含的每一表格后,通过Document python插件,读取该表格所对应的每个block的cell内容,并将其cell内容存储至预设的table_data数组,则该table_data数组所包含的数据为与该表格关联的各个字段值。After locating each table contained in the text to be analyzed, through the Document python plug-in, read the cell content of each block corresponding to the table and store its cell content into the preset table_data array, then the table_data array The data included is the value of each field associated with the table.
本申请实施例中,表格描述信息用于描述表格数据的主要内容,包括但不限于表格的标题、名字或者描述性信息。例如,若表格数据为A企业3月份的财政支出数据,则其表格描述信息可以为“3月份财政支出数据”。In the embodiment of the present application, the form description information is used to describe the main content of the form data, including but not limited to the title, name, or descriptive information of the form. For example, if the table data is the financial expenditure data of Enterprise A in March, the table description information may be "March fiscal expenditure data".
示例性地,根据每一表格所属的位置区域,可将该位置区域之前或将该位置区域之后的多个字符值进行提取,以将其确定为该表格的表格描述信息。Exemplarily, according to the location area to which each table belongs, multiple character values before the location area or after the location area may be extracted to determine it as the table description information of the table.
作为本申请的一个实施例,图3示出了本申请实施例提供的财政数据的获取方法S105的具体实现流程,详述如下:As an embodiment of the present application, FIG. 3 shows a specific implementation process of the method S105 for obtaining financial data provided in the embodiment of the present application, which is detailed as follows:
S10501:创建先进先出FIFO队列。S10501: Create a FIFO queue.
S10502:依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型。S10502: traverse each coding block in the text encoding in sequence, and obtain the page tag type corresponding to the currently traversed coding block.
S10503:若所述编码块所对应的页面标签类型为段落类型,则将所述编码块所包含的各个字符依序存入所述FIFO队列,并读取所述FIFO队列的实时队列长度。S10503: If the type of the page tag corresponding to the encoding block is a paragraph type, sequentially store each character contained in the encoding block into the FIFO queue, and read the real-time queue length of the FIFO queue.
S10504:若所述FIFO队列的实时队列长度大于预设阈值,则移除存在于FIFO队列底部的多个所述字符,并返回执行所述依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型的操作。S10504: if the real-time queue length of the FIFO queue is greater than a preset threshold, remove a plurality of the characters existing at the bottom of the FIFO queue, and return to execute the sequential traversal of each encoding block in the text encoding and obtain The operation of the page label type corresponding to the currently traversed coding block.
S10505:若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息。S10505: If the page tag type corresponding to the coding block is a table type, stitch each character in the FIFO queue, and output the stitching result as table description information associated with the table.
对定位出的每一表格,为了提取该表格的表格描述信息,先创建一个长度为预设值的先进先出队列(First Input
First Output,FIFO)。根据该表格所属的文本位置,确定出该文本位置之前的各个block,并依次读取上述各个block的页面标签类型。若存在任一block的页面标签为非空值,且其页面标签类型为段落类型,则将该block的cell内容压入FIFO队列中。For each form located, in order to extract the form description information of the form, first create a first-in, first-out queue with a preset length (First Input
First Output (FIFO). According to the text position to which the table belongs, each block before the text position is determined, and the page tag type of each block is read in turn. If the page tag of any block is non-empty and its page tag type is paragraph type, the cell content of the block is pushed into the FIFO queue.
本申请实施例中,在将block的cell内容压入FIFO队列之前,根据FIFO队列所包含的字符数,获取FIFO队列的实时队列长度。若实时队列长度大于预设的队列长度值,则表示FIFO队列已满,因此,将先进入FIFO队列的数据进行淘汰处理,以将当前所读取得到的block的cell内容压入处理后的FIFO队列中。此后,返回执行上述S1052,直至读取得到的block的页面标签类型为表格类型时,停止将任一block的cell内容压入FIFO队列。In the embodiment of the present application, before the cell content of the block is pushed into the FIFO queue, the real-time queue length of the FIFO queue is obtained according to the number of characters contained in the FIFO queue. If the real-time queue length is greater than the preset queue length value, it indicates that the FIFO queue is full. Therefore, the data that enters the FIFO queue first is eliminated, so as to push the currently read block cell content into the processed FIFO. In the queue. Thereafter, return to and execute the above S1052, and when the page label type of the read block is a table type, stop pushing the cell content of any block into the FIFO queue.
本申请实施例中,在停止将block的cell内容压入FIFO队列后,提取FIFO队列中所包含的各个字符,并将各个字符所拼接得到的字符串输出为与表格关联的表格描述信息。In the embodiment of the present application, after the cell content of the block is stopped from being pushed into the FIFO queue, each character contained in the FIFO queue is extracted, and a character string obtained by splicing each character is output as table description information associated with a table.
本申请实施例中,在检测到页面标签类型为表格类型的block时,通过停止将该block的cell内容压入FIFO队列,保证了FIFO队列所存储的各个字符为最接近表格位置区域的文本信息。通常来说,由于最接近表格位置区域的文本信息最能够体现表格数据的主要内容(例如,表格顶部的标题信息),因此,通过将FIFO队列中的各个字符进行拼接,并将该拼接结果输出为与表格关联的表格描述信息,实现表格描述信息的自动定位,提高了表格描述信息的提取准确率。In the embodiment of the present application, when a block having a page tag type of a table type is detected, by stopping the cell content of the block from being pushed into the FIFO queue, it is ensured that each character stored in the FIFO queue is the text information closest to the table location area. . Generally speaking, since the text information closest to the location area of the table can best reflect the main content of the table data (for example, the header information at the top of the table), by stitching the characters in the FIFO queue and outputting the result of the splicing For the table description information associated with the table, automatic positioning of the table description information is achieved, and the accuracy of extracting the table description information is improved.
作为本申请的一个实施例,图4示出了本申请实施例提供的财政数据的获取方法S105的另一具体实现流程,详述如下:As an embodiment of the present application, FIG. 4 shows another specific implementation process of the method S105 for obtaining financial data provided in the embodiment of the present application, which is detailed as follows:
S10506:若所述编码块所对应的页面标签类型为表格类型,则获取与预设关键词相关联的正则表达式。S10506: If the page tag type corresponding to the coding block is a table type, obtain a regular expression associated with a preset keyword.
S10507:基于所述正则表达式,对所述FIFO队列中的各个字符串进行检测处理。S10507: Perform detection processing on each character string in the FIFO queue based on the regular expression.
S10508:若所述FIFO队列中存在与所述正则表达式匹配的所述字符串,则将该字符串输出为与所述表格关联的表格描述信息。S10508: If the character string matching the regular expression exists in the FIFO queue, output the character string as form description information associated with the form.
S10509:若所述FIFO队列中不存在与所述正则表达式匹配的所述字符串,则分别计算所述FIFO队列中每一所述字符串与其所属编码块中所述表格标签的标签距离值。S10509: If the character string matching the regular expression does not exist in the FIFO queue, calculate a tag distance value between each of the character strings in the FIFO queue and the table label in the coding block to which the character string belongs. .
S10510:将所述标签距离值最小的一个所述字符串输出为与所述表格关联的表格描述信息。S10510: Output one of the character strings with the smallest tag distance value as table description information associated with the table.
本申请实施例中,基于该表格之前的文本信息来提取与该表格关联的表格描述信息,具体包括:在将页面标签类型为表格类型的block的cell内容压入FIFO队列后,获取与预设关联词关联的正则表达式。其中,预设关联词为与表格标题等表格描述性信息具有较大关联度的字符。例如,常见的表格标题通常都是以“XXX表”的格式存在,故对应该类表格标题的正则表达式可以是“[\s\S]*\表$”。在页面标签类型为表格类型的block中,基于获取得到的正则表达式,对FIFO队列中所存储的各个字符串进行检测处理。In the embodiment of the present application, extracting the table description information associated with the table based on the text information before the table specifically includes: after the cell content of the block whose page label type is the table type is pushed into the FIFO queue, obtaining and presetting A regular expression associated with the associated word. Wherein, the preset related words are characters having a high degree of relevance to the descriptive information of the table such as the table title. For example, common table titles usually exist in the format of "XXX table", so the regular expression corresponding to the class table title can be "[\ s \ S] * \ 表 $". In the block whose page label type is a table type, based on the obtained regular expression, each string stored in the FIFO queue is detected and processed.
若在FIFO队列中检测到满足上述正则表达式的字符串,则将该字符串进行提取,并将其输出为与表格关联的表格描述信息。If a character string satisfying the above regular expression is detected in the FIFO queue, the character string is extracted and output as the table description information associated with the table.
若在FIFO队列中未检测到满足上述正则表达式的字符串,则表示表格所属的文本位置之前,并未存在与表格标题相似的描述性信息,此时,以FIFO队列中相邻的N个(N为预设值,且N为大于1的整数)字符为一个字符串,根据其中最后一个字符所属block的样式标签,读取该block的标签距离值。标签距离值表示字符所属文本位置与当前页面底部的距离值。基于该方式,分别获取FIFO队列中各个字符串的标签距离值后,选取其中标签距离值最小的一个字符串。将上述标签距离值最小的一个字符串输出为与表格关联的表格描述信息。If no string that meets the above regular expression is detected in the FIFO queue, it means that there is no descriptive information similar to the table title before the text position to which the table belongs. At this time, N adjacent FIFO queues are used. (N is a preset value, and N is an integer greater than 1.) The character is a character string. According to the style tag of the block to which the last character belongs, read the tag distance value of the block. The label distance value indicates the distance between the text position of the character and the bottom of the current page. Based on this method, after obtaining the tag distance value of each character string in the FIFO queue, a character string with the smallest tag distance value is selected. A string with the smallest tag distance value is output as table description information associated with the table.
本申请实施例中,由于标签距离值最小的字符串与页面底部的距离较近,且该字符串所属的block位于表格之前,由此可确定出该字符串所属的文本位置也与表格的起始位置最为接近。通常来说,最接近表格起始位置的文本信息能够较为清楚地描述表格数据的主题内容,因此,通过将该字符串作为输出为与表格关联的表格描述信息,在一定程度上也提高了表格描述信息的准确性。In the embodiment of the present application, since the string with the smallest tag distance value is closer to the bottom of the page, and the block to which the string belongs is located before the table, it can be determined that the text position to which the string belongs also corresponds to the start of the table. The starting position is the closest. Generally speaking, the text information closest to the starting position of the table can more clearly describe the subject content of the table data. Therefore, by outputting this string as the table description information associated with the table, the table is also improved to a certain extent Describe the accuracy of the information.
S106:将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。S106: Output the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
本申请实施例中,在获取表格中的各个字段值以及获取与表格关联的表格描述信息后,根据各个字符的先后获取次序,依序将表格描述信息以及各个字段值输出至预先创建的文本文档。其中,文本文档的文本格式为txt格式。In the embodiment of the present application, after obtaining each field value in the form and obtaining the form description information associated with the form, according to the sequence of obtaining each character, the form description information and each field value are sequentially output to a pre-created text document . Among them, the text format of the text document is txt format.
优选地,上述文本文档中,在相邻的任意两个字段值之间,插入一预设的分隔符。Preferably, in the text document, a preset separator is inserted between any two adjacent field values.
优选地,将表格描述信息输出值上述文本文档的顶部位置,并在表格描述信息以及字段值之间,插入一换行符。Preferably, the form description information is output at the top position of the above text document, and a line break is inserted between the form description information and the field value.
本申请实施例中,将上述文本文档发送至预先连接的各个业务系统。由于各版本类型的业务系统对txt格式的文本文档均有着较好的兼容性,因而使得业务系统能够对该文本文档进行识别处理,以提取待分析文本所关联的财政数据。In the embodiment of the present application, the text document is sent to each service system connected in advance. Because the business systems of each version type have better compatibility with text files in txt format, the business system can identify and process the text files to extract the financial data associated with the text to be analyzed.
本申请实施例实现了对企业财政数据的快速分析,避免了需要基于复杂样式的公开文件来读取企业财政数据,故降低了企业财政数据的获取难度;由于业务系统可以通过上述文本文档来自动识别各类公开文件所包含的财政数据,故相对于现有技术来说,还达到了财政数据的多维度获取效果。The embodiment of the present application realizes the rapid analysis of corporate financial data, avoids the need to read corporate financial data based on public files of complex styles, thereby reducing the difficulty of obtaining corporate financial data; because the business system can automatically use the above text files to automatically Identifying the financial data contained in various types of public documents, compared to the prior art, multi-dimensional acquisition of financial data has also been achieved.
作为本申请的另一个实施例,如图5所示,在上述S106之后,还包括:As another embodiment of the present application, as shown in FIG. 5, after the above S106, the method further includes:
S107:加载报告模板,并根据所述报告模板中预先设置好的表头,将各项所述财政数据分别导入至对应的表体中。S107: Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template.
S108:根据导入结果,生成并展示财政数据分析报表。S108: Generate and display financial data analysis reports based on the import results.
本申请实施例中,加载预先生成的报告模板,所述报告模板包含各项表头,每一表头与一表体对应,且每一表头用于描述表格中一字段值的字段属性,每一表体用于记录一字段值。对于报告模板中预先设置好的每一表头,根据该表头所描述的字段属性,在S106所生成的文本文档的各项数据中,筛选出该字段属性所对应的字段值,并将该字段值导入报告模板该表头所对应的表体中。根据报告模板所导入的每一字段属性的字段值,通过预设的计算公式,分别计算各项统计信息值,以将得到的统计结果导入至报告模板的表尾后,输出并展示该财政数据分析报表。In the embodiment of the present application, a pre-generated report template is loaded, the report template includes various headers, each header corresponds to a body, and each header is used to describe a field attribute of a field value in the form, Each table body is used to record a field value. For each header set in the report template, according to the field attributes described in the header, in the data of the text document generated in S106, the field value corresponding to the field attribute is filtered, and the The field values are imported into the table body corresponding to the header of the report template. According to the field value of each field attribute imported by the report template, each statistical information value is calculated through a preset calculation formula to import the obtained statistical results to the footer of the report template, and then output and display the financial data Analyze the report.
本申请实施例中,通过将文本文档中的各项字段值导入预先生成的报告模板,使得最终展示的财政数据分析报表能够详细列举数据分析过程中的各个字段值,便于用户检查财政数据的分析过程是否有误,因而也进一步提高了财政数据分析报表的可靠性以及准确性。In the embodiment of the present application, the field values in the text document are imported into a pre-generated report template, so that the final financial data analysis report can list the field values in the data analysis process in detail, which is convenient for users to check the analysis of financial data Whether the process is wrong, thereby further improving the reliability and accuracy of financial data analysis reports.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
对应于上文实施例所述的财政数据的获取方法,图6示出了本申请实施例提供的财政数据的获取装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the method for acquiring financial data described in the foregoing embodiment, FIG. 6 shows a structural block diagram of a device for acquiring financial data provided by an embodiment of the present application. For convenience of explanation, only the relevant data of the embodiment of the present application are shown. section.
参照图6,该装置包括:Referring to FIG. 6, the device includes:
第一获取单元61,用于获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式。The first obtaining unit 61 is configured to obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.
转换单元62,用于通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式。The conversion unit 62 is configured to convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool.
第二获取单元63,用于基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签。The second obtaining unit 63 is configured to obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format, where the text encoding includes multiple types of page tags.
查找单元64,用于查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格。The searching unit 64 is configured to search for a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs.
提取单元65,用于提取与所述表格关联的各个字段值以及表格描述信息。The extraction unit 65 is configured to extract various field values and table description information associated with the table.
输出单元66,用于将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。An output unit 66 is configured to output the form description information and each of the field values to a pre-created text document, so that the business system obtains the text document to be analyzed after the text document is identified. Financial data.
可选地,所述查找单元64包括:Optionally, the search unit 64 includes:
遍历子单元,用于依次遍历所述文本编码中的各个编码块。The traversing subunit is used to sequentially traverse each coding block in the text coding.
判断子单元,用于对每一所述编码块,判断该编码块所对应的页面标签类型是否为表格类型。The judging subunit is configured to judge, for each of the coding blocks, whether a page tag type corresponding to the coding block is a table type.
标记子单元,用于若该编码块所对应的页面标签类型为表格类型,则将内置标志位的属性值置为逻辑真值,以将该编码块所对应的文本位置标记为表格的起始位置。A marking subunit, configured to set the attribute value of the built-in flag bit to a logical truth value if the page tag type corresponding to the coding block is a table type, so as to mark the text position corresponding to the coding block as the start of the table position.
返回子单元,用于返回执行所述依次遍历所述文本编码中的各个编码块的操作,直至取出的编码块所对应的页面标签类型为非表格类型且为非空值时,将该编码块所对应的文本位置标记为所述表格的结束位置。A return subunit, for returning to perform the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, the encoding block The corresponding text position is marked as the end position of the table.
可选地,所述提取单元65包括:Optionally, the extraction unit 65 includes:
创建子单元,用于创建先进先出FIFO队列。Create a subunit to create a FIFO queue.
获取子单元,用于依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型。An acquisition subunit is configured to sequentially traverse each coding block in the text encoding, and obtain a page tag type corresponding to the currently traversed coding block.
存储子单元,用于若所述编码块所对应的页面标签类型为段落类型,则将所述编码块所包含的各个字符依序存入所述FIFO队列,并读取所述FIFO队列的实时队列长度。A storage subunit, configured to sequentially store each character contained in the encoding block into the FIFO queue if the page tag type corresponding to the encoding block is a paragraph type, and read the real-time of the FIFO queue The queue length.
移除子单元,用于若所述FIFO队列的实时队列长度大于预设阈值,则移除存在于FIFO队列底部的多个所述字符,并返回执行所述依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型的操作。A removing subunit, configured to remove a plurality of the characters existing at the bottom of the FIFO queue if the real-time queue length of the FIFO queue is greater than a preset threshold, and return to execute each of the text encoding in turn An operation of encoding a block and obtaining a page tag type corresponding to the currently traversed encoding block.
拼接子单元,用于若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息。The splicing subunit is configured to splice each character in the FIFO queue if the page tag type corresponding to the coding block is a table type, and output the splicing result as table description information associated with the table.
可选地,所述拼接子单元具体用于:若所述编码块所对应的页面标签类型为表格类型,则获取与预设关键词相关联的正则表达式;Optionally, the splicing subunit is specifically configured to: if the page tag type corresponding to the coding block is a table type, obtain a regular expression associated with a preset keyword;
基于所述正则表达式,对所述FIFO队列中的各个字符串进行检测处理;Performing detection processing on each character string in the FIFO queue based on the regular expression;
若所述FIFO队列中存在与所述正则表达式匹配的所述字符串,则将该字符串输出为与所述表格关联的表格描述信息;If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;
若所述FIFO队列中不存在与所述正则表达式匹配的所述字符串,则分别计算所述FIFO队列中每一所述字符串与其所属编码块中所述表格标签的标签距离值;If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;
将所述标签距离值最小的一个所述字符串输出为与所述表格关联的表格描述信息。Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
可选地,所述财政数据的获取装置还包括:加载单元,用于加载报告模板,并根据所述报告模板中预先设置好的表头,将各项所述财政数据分别导入至对应的表体中。Optionally, the apparatus for acquiring financial data further includes: a loading unit for loading a report template, and importing each of the financial data into a corresponding table according to a pre-set header in the report template. Body.
生成单元,用于根据导入结果,生成并展示财政数据分析报表。A generating unit is used to generate and display financial data analysis reports based on the import results.
图7是本申请一实施例提供的终端设备的示意图。如图7所示,该实施例的终端设备7包括处理器70以及存储器71,所述存储器71中存储有可在所述处理器70上运行的计算机可读指令72,例如财政数据的获取程序。所述处理器70执行所述计算机可读指令72时实现上述各个财政数据的获取方法实施例中的步骤,例如图1所示的步骤101至106。或者,所述处理器70执行所述计算机可读指令72时实现上述各装置实施例中各模块/单元的功能,例如图6所示单元61至66的功能。FIG. 7 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 7, the terminal device 7 of this embodiment includes a processor 70 and a memory 71. The memory 71 stores computer-readable instructions 72 that can be run on the processor 70, such as a program for acquiring financial data. . When the processor 70 executes the computer-readable instructions 72, the steps in the embodiment of the method for acquiring financial data are implemented, for example, steps 101 to 106 shown in FIG. Alternatively, when the processor 70 executes the computer-readable instructions 72, the functions of the modules / units in the foregoing device embodiments are implemented, for example, the functions of the units 61 to 66 shown in FIG. 6.
示例性的,所述计算机可读指令72可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器71中,并由所述处理器70执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令72在所述终端设备7中的执行过程。Exemplarily, the computer-readable instructions 72 may be divided into one or more modules / units, the one or more modules / units are stored in the memory 71 and executed by the processor 70, To complete this application. The one or more modules / units may be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 72 in the terminal device 7.
所述终端设备7可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端设备可包括,但不仅限于,处理器70、存储器71。本领域技术人员可以理解,图7仅仅是终端设备7的示例,并不构成对终端设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。The terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 70 and a memory 71. Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 7 and does not constitute a limitation on the terminal device 7. It may include more or fewer components than shown in the figure, or combine some components or different components. For example, the terminal device may further include an input / output device, a network access device, a bus, and the like.
所称处理器70可以是中央处理单元(Central
Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application
Specific Integrated Circuit,ASIC)、现成可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 70 may be a central processing unit (Central
Processing Unit (CPU), or other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (Application Specific Integrated Circuits)
Specific Integrated Circuit (ASIC), off-the-shelf Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
所述存储器71可以是所述终端设备7的内部存储单元,例如终端设备7的硬盘或内存。所述存储器71也可以是所述终端设备7的外部存储设备,例如所述终端设备7上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器71还可以既包括所述终端设备7的内部存储单元也包括外部存储设备。所述存储器71用于存储所述计算机可读指令以及所述终端设备所需的其他程序和数据。所述存储器71还可以用于暂时地存储已经输出或者将要输出的数据。The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD) provided on the terminal device 7. Card, flash card, etc. Further, the memory 71 may further include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is configured to store the computer-readable instructions and other programs and data required by the terminal device. The memory 71 may also be used to temporarily store data that has been output or is to be output.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium , Including a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks, or compact discs, and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to describe the technical solution of the present application, rather than limiting the present invention. Although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still interpret the foregoing. The technical solutions described in the embodiments are modified, or some technical features are equivalently replaced; and these modifications or replacements do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.
Claims (20)
- 一种财政数据的获取方法,其特征在于,包括:A method for obtaining financial data, comprising:获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式;Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式;Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签;Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格;Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;提取与所述表格关联的各个字段值以及表格描述信息;Extracting each field value and form description information associated with the form;将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
- 如权利要求1所述的财政数据的获取方法,其特征在于,所述查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格,包括:The method for obtaining financial data according to claim 1, wherein the search finds a table label in the page label, and locates the existing text in the text to be analyzed according to the text position to which the table label belongs. Forms, including:依次遍历所述文本编码中的各个编码块;Successively traverse each coding block in the text coding;对每一所述编码块,判断该编码块所对应的页面标签类型是否为表格类型;For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type;若该编码块所对应的页面标签类型为表格类型,则将内置标志位的属性值置为逻辑真值,以将该编码块所对应的文本位置标记为表格的起始位置;If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table;返回执行所述依次遍历所述文本编码中的各个编码块的操作,直至取出的编码块所对应的页面标签类型为非表格类型且为非空值时,将该编码块所对应的文本位置标记为所述表格的结束位置。Return to perform the operation of sequentially traversing each coding block in the text encoding until the page label type corresponding to the extracted coding block is a non-table type and a non-null value, mark the text position corresponding to the coding block Is the end position of the table.
- 如权利要求1所述的财政数据的获取方法,其特征在于,所述提取与所述表格关联的各个字段值以及表格描述信息,包括:The method for obtaining financial data according to claim 1, wherein the extracting each field value and form description information associated with the form comprises:创建先进先出FIFO队列;Create FIFO queues;依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型;Successively traverse each coding block in the text encoding, and obtain the page tag type corresponding to the currently traversed coding block;若所述编码块所对应的页面标签类型为段落类型,则将所述编码块所包含的各个字符依序存入所述FIFO队列,并读取所述FIFO队列的实时队列长度;If the page tag type corresponding to the coding block is a paragraph type, storing each character contained in the coding block into the FIFO queue in order, and reading the real-time queue length of the FIFO queue;若所述FIFO队列的实时队列长度大于预设阈值,则移除存在于FIFO队列底部的多个所述字符,并返回执行所述依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型的操作;If the real-time queue length of the FIFO queue is greater than a preset threshold, removing a plurality of the characters existing at the bottom of the FIFO queue, and returning to executing the sequential traversal of each encoding block in the text encoding, and obtaining the current Operations of page tag types corresponding to the traversed code blocks;若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息。If the page tag type corresponding to the coding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as table description information associated with the table.
- 如权利要求3所述的财政数据的获取方法,其特征在于,所述若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息,包括:The method for obtaining financial data according to claim 3, wherein if the page tag type corresponding to the encoding block is a table type, the characters in the FIFO queue are spliced, and the splicing is performed. The result output is table description information associated with the table, including:若所述编码块所对应的页面标签类型为表格类型,则获取与预设关键词相关联的正则表达式;If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;基于所述正则表达式,对所述FIFO队列中的各个字符串进行检测处理;Performing detection processing on each character string in the FIFO queue based on the regular expression;若所述FIFO队列中存在与所述正则表达式匹配的所述字符串,则将该字符串输出为与所述表格关联的表格描述信息;If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;若所述FIFO队列中不存在与所述正则表达式匹配的所述字符串,则分别计算所述FIFO队列中每一所述字符串与其所属编码块中所述表格标签的标签距离值;If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;将所述标签距离值最小的一个所述字符串输出为与所述表格关联的表格描述信息。Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
- 如权利要求1所述的财政数据的获取方法,其特征在于,在所述将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据之后,还包括:The method for obtaining financial data according to claim 1, wherein, in the step of outputting the form description information and each of the field values to a pre-created text document, a business system makes the text document After performing the identification processing, after obtaining the financial data associated with the text to be analyzed, the method further includes:加载报告模板,并根据所述报告模板中预先设置好的表头,将各项所述财政数据分别导入至对应的表体中;Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;根据导入结果,生成并展示财政数据分析报表。Generate and display financial data analysis reports based on the import results.
- 一种财政数据的获取装置,其特征在于,包括:An apparatus for acquiring financial data, which is characterized by comprising:第一获取单元,用于获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式。The first obtaining unit is configured to obtain a pre-published text to be analyzed, and an initial format of the text to be analyzed is a portable document pdf format.转换单元,用于通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式;A conversion unit, configured to convert a text format of the text to be analyzed from the pdf format to a document doc format by using a preset text conversion tool;第二获取单元,用于基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签;A second obtaining unit, configured to obtain a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;查找单元,用于查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格;A searching unit, configured to find a form label in the page label, and locate a form existing in the text to be analyzed according to a text position to which the form label belongs;提取单元,用于提取与所述表格关联的各个字段值以及表格描述信息;An extraction unit, configured to extract each field value and table description information associated with the table;输出单元,用于将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。An output unit, configured to output the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial information associated with the text to be analyzed after identifying and processing the text document data.
- 根据权利要求6所述的财政数据的获取装置,其特征在于,所述查找单元包括:The apparatus for acquiring financial data according to claim 6, wherein the search unit comprises:遍历子单元,用于依次遍历所述文本编码中的各个编码块;A traversal subunit, for sequentially traversing each encoding block in the text encoding;判断子单元,用于对每一所述编码块,判断该编码块所对应的页面标签类型是否为表格类型;A judging subunit, configured to judge, for each of the coding blocks, whether a page tag type corresponding to the coding block is a table type;标记子单元,用于若该编码块所对应的页面标签类型为表格类型,则将内置标志位的属性值置为逻辑真值,以将该编码块所对应的文本位置标记为表格的起始位置;A marking subunit, configured to set the attribute value of the built-in flag bit to a logical truth value if the page tag type corresponding to the coding block is a table type, so as to mark the text position corresponding to the coding block as the start of the table position;返回子单元,用于返回执行所述依次遍历所述文本编码中的各个编码块的操作,直至取出的编码块所对应的页面标签类型为非表格类型且为非空值时,将该编码块所对应的文本位置标记为所述表格的结束位置。A return subunit, for returning to perform the operation of successively traversing each encoding block in the text encoding until the page tag type corresponding to the extracted encoding block is a non-table type and a non-null value, the encoding block The corresponding text position is marked as the end position of the table.
- 根据权利要求6所述的财政数据的获取装置,其特征在于,所述提取单元包括:The apparatus for acquiring financial data according to claim 6, wherein the extraction unit comprises:创建子单元,用于创建先进先出FIFO队列;Create a sub-unit for creating a FIFO queue;获取子单元,用于依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型;An acquisition subunit, configured to sequentially traverse each encoding block in the text encoding, and obtain a page tag type corresponding to the currently traversed encoding block;存储子单元,用于若所述编码块所对应的页面标签类型为段落类型,则将所述编码块所包含的各个字符依序存入所述FIFO队列,并读取所述FIFO队列的实时队列长度;A storage subunit, configured to sequentially store each character contained in the encoding block into the FIFO queue if the page tag type corresponding to the encoding block is a paragraph type, and read the real-time of the FIFO queue Queue length移除子单元,用于若所述FIFO队列的实时队列长度大于预设阈值,则移除存在于FIFO队列底部的多个所述字符,并返回执行所述依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型的操作;A removing subunit, configured to remove a plurality of the characters existing at the bottom of the FIFO queue if the real-time queue length of the FIFO queue is greater than a preset threshold, and return to execute each of the text encoding in turn An operation of encoding a block and obtaining a page tag type corresponding to the currently traversed encoding block;拼接子单元,用于若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息。The splicing subunit is configured to splice each character in the FIFO queue if the page tag type corresponding to the coding block is a table type, and output the splicing result as table description information associated with the table.
- 根据权利要求8所述的财政数据的获取装置,其特征在于,所述拼接子单元具体用于:The apparatus for acquiring financial data according to claim 8, wherein the splicing subunit is specifically configured to:若所述编码块所对应的页面标签类型为表格类型,则获取与预设关键词相关联的正则表达式;If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;基于所述正则表达式,对所述FIFO队列中的各个字符串进行检测处理;Performing detection processing on each character string in the FIFO queue based on the regular expression;若所述FIFO队列中存在与所述正则表达式匹配的所述字符串,则将该字符串输出为与所述表格关联的表格描述信息;If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;若所述FIFO队列中不存在与所述正则表达式匹配的所述字符串,则分别计算所述FIFO队列中每一所述字符串与其所属编码块中所述表格标签的标签距离值;If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;将所述标签距离值最小的一个所述字符串输出为与所述表格关联的表格描述信息。Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
- 根据权利要求6所述的财政数据的获取装置,其特征在于,还包括:The apparatus for acquiring financial data according to claim 6, further comprising:加载单元,用于加载报告模板,并根据所述报告模板中预先设置好的表头,将各项所述财政数据分别导入至对应的表体中;A loading unit, configured to load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;生成单元,用于根据导入结果,生成并展示财政数据分析报表。A generating unit is used to generate and display financial data analysis reports based on the import results.
- 一种终端设备,其特征在于,包括存储器以及处理器,所述存储器中存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A terminal device includes a memory and a processor, and the memory stores computer-readable instructions that can be run on the processor. When the processor executes the computer-readable instructions, the following steps are implemented: :获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式;Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式;Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签;Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格;Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;提取与所述表格关联的各个字段值以及表格描述信息;Extracting each field value and form description information associated with the form;将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
- 根据权利要求11所述的终端设备,其特征在于,所述查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格,包括:The terminal device according to claim 11, wherein the searching for a form tag in the page tag, and locating a form existing in the text to be analyzed according to a text position to which the form tag belongs, comprises: :依次遍历所述文本编码中的各个编码块;Successively traverse each coding block in the text coding;对每一所述编码块,判断该编码块所对应的页面标签类型是否为表格类型;For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type;若该编码块所对应的页面标签类型为表格类型,则将内置标志位的属性值置为逻辑真值,以将该编码块所对应的文本位置标记为表格的起始位置;If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table;返回执行所述依次遍历所述文本编码中的各个编码块的操作,直至取出的编码块所对应的页面标签类型为非表格类型且为非空值时,将该编码块所对应的文本位置标记为所述表格的结束位置。Return to perform the operation of sequentially traversing each coding block in the text encoding until the page label type corresponding to the extracted coding block is a non-table type and a non-null value, mark the text position corresponding to the coding block Is the end position of the table.
- 根据权利要求11所述的终端设备,其特征在于,所述提取与所述表格关联的各个字段值以及表格描述信息,包括:The terminal device according to claim 11, wherein the extracting each field value and table description information associated with the table comprises:创建先进先出FIFO队列;Create FIFO queues;依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型;Successively traverse each coding block in the text encoding, and obtain the page tag type corresponding to the currently traversed coding block;若所述编码块所对应的页面标签类型为段落类型,则将所述编码块所包含的各个字符依序存入所述FIFO队列,并读取所述FIFO队列的实时队列长度;If the page tag type corresponding to the coding block is a paragraph type, storing each character contained in the coding block into the FIFO queue in order, and reading the real-time queue length of the FIFO queue;若所述FIFO队列的实时队列长度大于预设阈值,则移除存在于FIFO队列底部的多个所述字符,并返回执行所述依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型的操作;If the real-time queue length of the FIFO queue is greater than a preset threshold, removing a plurality of the characters existing at the bottom of the FIFO queue, and returning to executing the sequential traversal of each encoding block in the text encoding and obtaining the Operations of page tag types corresponding to the traversed code blocks;若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息。If the page tag type corresponding to the coding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as table description information associated with the table.
- 根据权利要求13所述的终端设备,其特征在于,所述若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息,包括:The terminal device according to claim 13, wherein if the page label type corresponding to the encoding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as The form description information associated with the form includes:若所述编码块所对应的页面标签类型为表格类型,则获取与预设关键词相关联的正则表达式;If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;基于所述正则表达式,对所述FIFO队列中的各个字符串进行检测处理;Performing detection processing on each character string in the FIFO queue based on the regular expression;若所述FIFO队列中存在与所述正则表达式匹配的所述字符串,则将该字符串输出为与所述表格关联的表格描述信息;If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;若所述FIFO队列中不存在与所述正则表达式匹配的所述字符串,则分别计算所述FIFO队列中每一所述字符串与其所属编码块中所述表格标签的标签距离值;If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;将所述标签距离值最小的一个所述字符串输出为与所述表格关联的表格描述信息。Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
- 根据权利要求11所述的终端设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:The terminal device according to claim 11, wherein the processor further implements the following steps when executing the computer-readable instructions:加载报告模板,并根据所述报告模板中预先设置好的表头,将各项所述财政数据分别导入至对应的表体中;Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;根据导入结果,生成并展示财政数据分析报表。Generate and display financial data analysis reports based on the import results.
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被至少一个处理器执行时实现如下步骤:A computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions implement the following steps when executed by at least one processor:获取预先发布的待分析文本,所述待分析文本的初始格式为可移植文档pdf格式;Obtaining a pre-published text to be analyzed, an initial format of the text to be analyzed is a portable document pdf format;通过预设的文本转换工具,将所述待分析文本的文本格式由所述pdf格式转换为文档doc格式;Converting a text format of the text to be analyzed from the pdf format to a document doc format through a preset text conversion tool;基于所述doc格式的所述待分析文本,获取所述待分析文本所对应的文本编码;其中,所述文本编码包含多种类型的页面标签;Obtaining a text encoding corresponding to the text to be analyzed based on the text to be analyzed in the doc format; wherein the text encoding includes multiple types of page tags;查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格;Find a form tag in the page tag, and locate a form existing in the text to be analyzed according to a text position to which the form tag belongs;提取与所述表格关联的各个字段值以及表格描述信息;Extracting each field value and form description information associated with the form;将所述表格描述信息以及每一所述字段值输出至预先创建的文本文档,以使业务系统对所述文本文档进行识别处理后,获取所述待分析文本所关联的财政数据。Outputting the form description information and each of the field values to a pre-created text document, so that the business system obtains the financial data associated with the text to be analyzed after the text document is identified.
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述查找所述页面标签中的表格标签,并根据所述表格标签所属的文本位置,定位所述待分析文本中所存在的表格,包括:The computer-readable storage medium according to claim 16, wherein the search finds a form tag in the page tag, and locates the presence of the text in the text to be analyzed according to the text position to which the form tag belongs. Forms, including:依次遍历所述文本编码中的各个编码块;Successively traverse each coding block in the text coding;对每一所述编码块,判断该编码块所对应的页面标签类型是否为表格类型;For each of the coding blocks, determine whether a page tag type corresponding to the coding block is a table type;若该编码块所对应的页面标签类型为表格类型,则将内置标志位的属性值置为逻辑真值,以将该编码块所对应的文本位置标记为表格的起始位置;If the page tag type corresponding to the coding block is a table type, set the attribute value of the built-in flag bit to a logical truth value to mark the text position corresponding to the coding block as the starting position of the table;返回执行所述依次遍历所述文本编码中的各个编码块的操作,直至取出的编码块所对应的页面标签类型为非表格类型且为非空值时,将该编码块所对应的文本位置标记为所述表格的结束位置。Return to perform the operation of sequentially traversing each coding block in the text encoding until the page label type corresponding to the extracted coding block is a non-table type and a non-null value, mark the text position corresponding to the coding block Is the end position of the table.
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述提取与所述表格关联的各个字段值以及表格描述信息,包括:The computer-readable storage medium according to claim 16, wherein the extracting each field value and form description information associated with the form comprises:创建先进先出FIFO队列;Create FIFO queues;依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型;Successively traverse each coding block in the text encoding, and obtain the page tag type corresponding to the currently traversed coding block;若所述编码块所对应的页面标签类型为段落类型,则将所述编码块所包含的各个字符依序存入所述FIFO队列,并读取所述FIFO队列的实时队列长度;If the page tag type corresponding to the coding block is a paragraph type, storing each character contained in the coding block into the FIFO queue in order, and reading the real-time queue length of the FIFO queue;若所述FIFO队列的实时队列长度大于预设阈值,则移除存在于FIFO队列底部的多个所述字符,并返回执行所述依次遍历所述文本编码中的各个编码块,并获取当前所遍历的所述编码块所对应的页面标签类型的操作;If the real-time queue length of the FIFO queue is greater than a preset threshold, removing a plurality of the characters existing at the bottom of the FIFO queue, and returning to executing the sequential traversal of each encoding block in the text encoding and obtaining the Operations of page tag types corresponding to the traversed code blocks;若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息。If the page tag type corresponding to the coding block is a table type, the characters in the FIFO queue are spliced, and the splicing result is output as table description information associated with the table.
- 根据权利要求18所述的计算机可读存储介质,其特征在于,所述若所述编码块所对应的页面标签类型为表格类型,则将所述FIFO队列中的各个字符进行拼接,并将拼接结果输出为与所述表格关联的表格描述信息,包括:The computer-readable storage medium according to claim 18, wherein if the page label type corresponding to the encoding block is a table type, stitching each character in the FIFO queue, and stitching The result output is table description information associated with the table, including:若所述编码块所对应的页面标签类型为表格类型,则获取与预设关键词相关联的正则表达式;If the page tag type corresponding to the coding block is a table type, obtaining a regular expression associated with a preset keyword;基于所述正则表达式,对所述FIFO队列中的各个字符串进行检测处理;Performing detection processing on each character string in the FIFO queue based on the regular expression;若所述FIFO队列中存在与所述正则表达式匹配的所述字符串,则将该字符串输出为与所述表格关联的表格描述信息;If the character string matching the regular expression exists in the FIFO queue, outputting the character string as table description information associated with the table;若所述FIFO队列中不存在与所述正则表达式匹配的所述字符串,则分别计算所述FIFO队列中每一所述字符串与其所属编码块中所述表格标签的标签距离值;If the string matching the regular expression does not exist in the FIFO queue, calculating a tag distance value between each of the strings in the FIFO queue and the table label in the coding block to which the string belongs;将所述标签距离值最小的一个所述字符串输出为与所述表格关联的表格描述信息。Outputting one of the character strings with the smallest tag distance value as table description information associated with the table.
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述计算机可读指令被至少一个处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 16, wherein when the computer-readable instructions are executed by at least one processor, the following steps are further implemented:加载报告模板,并根据所述报告模板中预先设置好的表头,将各项所述财政数据分别导入至对应的表体中;Load a report template, and import each of the financial data into a corresponding table body according to a preset header in the report template;根据导入结果,生成并展示财政数据分析报表。Generate and display financial data analysis reports based on the import results.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810600697.4A CN109062874B (en) | 2018-06-12 | 2018-06-12 | Financial data acquisition method, terminal device and medium |
CN201810600697.4 | 2018-06-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019237540A1 true WO2019237540A1 (en) | 2019-12-19 |
Family
ID=64820303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/105532 WO2019237540A1 (en) | 2018-06-12 | 2018-09-13 | Method and device for acquiring financial data, terminal device, and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109062874B (en) |
WO (1) | WO2019237540A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401058A (en) * | 2020-03-12 | 2020-07-10 | 广州大学 | Attribute value extraction method and device based on named entity recognition tool |
CN111476015A (en) * | 2020-04-10 | 2020-07-31 | 北京字节跳动网络技术有限公司 | A document processing method, device, electronic device and storage medium |
CN111538750A (en) * | 2020-06-24 | 2020-08-14 | 深圳壹账通智能科技有限公司 | Information restoration method and device, computer system and readable storage medium |
CN111562965A (en) * | 2020-04-27 | 2020-08-21 | 深圳木成林科技有限公司 | Page data verification method and device based on decision tree |
CN112100366A (en) * | 2020-09-17 | 2020-12-18 | 广联达科技股份有限公司 | Pavement structure layer display method and device, computer equipment and storage medium |
CN112214987A (en) * | 2020-09-08 | 2021-01-12 | 深圳价值在线信息科技股份有限公司 | Information extraction method, extraction device, terminal equipment and readable storage medium |
CN112434096A (en) * | 2020-11-30 | 2021-03-02 | 上海天旦网络科技发展有限公司 | Service analysis system and method based on intelligent label |
CN112597353A (en) * | 2020-12-18 | 2021-04-02 | 武汉大学 | Automatic text information extraction method |
CN113312053A (en) * | 2020-02-27 | 2021-08-27 | 北京沃东天骏信息技术有限公司 | Data processing method and device |
CN113342811A (en) * | 2021-05-31 | 2021-09-03 | 中国工商银行股份有限公司 | HBase table data processing method and device |
CN113761044A (en) * | 2021-08-30 | 2021-12-07 | 上海快确信息科技有限公司 | Labeling system method for labeling text into table |
CN113822030A (en) * | 2020-10-19 | 2021-12-21 | 北京沃东天骏信息技术有限公司 | Data export method, device and storage medium |
CN113872963A (en) * | 2021-09-26 | 2021-12-31 | 中水北方勘测设计研究有限责任公司 | Message protocol rapid analysis method and system based on free label splicing technology |
CN113962328A (en) * | 2021-11-12 | 2022-01-21 | 上海冰鉴信息科技有限公司 | Data comparison analysis method, device and equipment |
CN114692792A (en) * | 2022-03-22 | 2022-07-01 | 深圳市利和兴股份有限公司 | Makeup radio frequency identification testing platform |
CN115545008A (en) * | 2022-11-29 | 2022-12-30 | 明度智云(浙江)科技有限公司 | Spectrogram file analyzing method, device, equipment and storage medium |
CN117010349A (en) * | 2023-09-28 | 2023-11-07 | 杭州今元标矩科技有限公司 | Form filling method, system and storage medium based on neural network model |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109871524B (en) * | 2019-02-21 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Chart generation method and device |
CN110263311B (en) * | 2019-05-22 | 2024-07-05 | 中国平安财产保险股份有限公司 | Method and device for generating network page |
CN110334331A (en) * | 2019-05-30 | 2019-10-15 | 重庆金融资产交易所有限责任公司 | Method, apparatus and computer equipment based on order models screening table |
CN110188107B (en) * | 2019-06-05 | 2020-05-01 | 中科鼎富(北京)科技发展有限公司 | Method and device for extracting information from table |
CN110297905A (en) * | 2019-06-27 | 2019-10-01 | 郑州铁路职业技术学院 | A kind of computer system for economic management analysis data |
CN110909112B (en) * | 2019-10-18 | 2022-08-26 | 深圳价值在线信息科技股份有限公司 | Data extraction method, device, terminal equipment and medium |
CN110909123B (en) * | 2019-10-23 | 2023-08-25 | 深圳价值在线信息科技股份有限公司 | Data extraction method and device, terminal equipment and storage medium |
CN112287660B (en) * | 2019-12-04 | 2024-05-31 | 上海柯林布瑞信息技术有限公司 | Table analysis method and device in PDF file, computing equipment and storage medium |
CN111027285B (en) * | 2019-12-17 | 2023-06-16 | 南京上游软件有限公司 | Method and system for automatically extracting order information from pdf format order |
CN111367988A (en) * | 2020-03-31 | 2020-07-03 | 中国建设银行股份有限公司 | Data import method and device |
CN112035412B (en) * | 2020-08-31 | 2024-10-29 | 三六零数字安全科技集团有限公司 | Data file importing method, device, storage medium and apparatus |
CN112699637B (en) * | 2021-01-08 | 2024-04-12 | 中南大学 | Paragraph type recognition method and system and document structure recognition method and system |
CN112949476B (en) * | 2021-03-01 | 2023-09-29 | 苏州美能华智能科技有限公司 | Text relation detection method, device and storage medium based on graph convolution neural network |
CN113988011A (en) * | 2021-08-19 | 2022-01-28 | 中核核电运行管理有限公司 | Document content identification method and device |
CN113946664A (en) * | 2021-09-03 | 2022-01-18 | 杭州费尔斯通科技有限公司 | Method, system, apparatus and medium for generating table representation based on fields |
CN113963367B (en) * | 2021-10-22 | 2024-05-28 | 深圳前海环融联易信息科技服务有限公司 | Model-based financial transaction file and money extraction method |
CN114428839A (en) * | 2022-01-27 | 2022-05-03 | 北京百度网讯科技有限公司 | Data processing method, paragraph text determination device and electronic equipment |
CN117350264B (en) * | 2023-12-04 | 2024-02-23 | 税友软件集团股份有限公司 | PPT file generation method, device, equipment and storage medium |
CN117593752B (en) * | 2024-01-18 | 2024-04-09 | 星云海数字科技股份有限公司 | PDF document input method, PDF document input system, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN102855243A (en) * | 2011-06-28 | 2013-01-02 | 北大方正集团有限公司 | Method and device for extracting document structure |
CN106484663A (en) * | 2016-10-12 | 2017-03-08 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method of document content and device |
CN106897690A (en) * | 2017-02-22 | 2017-06-27 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN107818075A (en) * | 2017-10-16 | 2018-03-20 | 平安科技(深圳)有限公司 | Form data structuring extracting method, electronic equipment and computer-readable recording medium |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1289994C (en) * | 2003-11-04 | 2006-12-13 | 北京华安天诚科技有限公司 | Handwritten flying data displaying and inputting apparatus and method for air communication control |
CN101360100B (en) * | 2008-09-16 | 2011-08-17 | 浙江汇信科技有限公司 | Digital signing, sealing and authenticating method for PDF document |
CN103198069A (en) * | 2012-01-06 | 2013-07-10 | 株式会社理光 | Method and device for extracting relational table |
US9536141B2 (en) * | 2012-06-29 | 2017-01-03 | Palo Alto Research Center Incorporated | System and method for forms recognition by synthesizing corrected localization of data fields |
CN103605349B (en) * | 2013-11-26 | 2017-11-14 | 厦门雅迅网络股份有限公司 | A kind of remote real-time data collection and analytic statistics system and method based on CAN bus |
CN104199975A (en) * | 2014-09-23 | 2014-12-10 | 中国南方电网有限责任公司 | Configurable WORD file structured extraction method |
CN105589841B (en) * | 2016-01-15 | 2018-03-30 | 同方知网(北京)技术有限公司 | A kind of method of PDF document Table recognition |
CN107689070B (en) * | 2017-08-31 | 2021-06-04 | 平安科技(深圳)有限公司 | Chart data structured extraction method, electronic device and computer-readable storage medium |
-
2018
- 2018-06-12 CN CN201810600697.4A patent/CN109062874B/en active Active
- 2018-09-13 WO PCT/CN2018/105532 patent/WO2019237540A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN102855243A (en) * | 2011-06-28 | 2013-01-02 | 北大方正集团有限公司 | Method and device for extracting document structure |
CN106484663A (en) * | 2016-10-12 | 2017-03-08 | 天闻数媒科技(湖南)有限公司 | A kind of extracting method of document content and device |
CN106897690A (en) * | 2017-02-22 | 2017-06-27 | 南京述酷信息技术有限公司 | PDF table extracting methods |
CN107818075A (en) * | 2017-10-16 | 2018-03-20 | 平安科技(深圳)有限公司 | Form data structuring extracting method, electronic equipment and computer-readable recording medium |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312053A (en) * | 2020-02-27 | 2021-08-27 | 北京沃东天骏信息技术有限公司 | Data processing method and device |
CN111401058B (en) * | 2020-03-12 | 2023-05-02 | 广州大学 | Attribute value extraction method and device based on named entity recognition tool |
CN111401058A (en) * | 2020-03-12 | 2020-07-10 | 广州大学 | Attribute value extraction method and device based on named entity recognition tool |
CN111476015A (en) * | 2020-04-10 | 2020-07-31 | 北京字节跳动网络技术有限公司 | A document processing method, device, electronic device and storage medium |
CN111476015B (en) * | 2020-04-10 | 2024-01-05 | 北京字节跳动网络技术有限公司 | Document processing method and device, electronic equipment and storage medium |
CN111562965B (en) * | 2020-04-27 | 2024-01-05 | 深圳手回科技集团有限公司 | Page data verification method and device based on decision tree |
CN111562965A (en) * | 2020-04-27 | 2020-08-21 | 深圳木成林科技有限公司 | Page data verification method and device based on decision tree |
CN111538750A (en) * | 2020-06-24 | 2020-08-14 | 深圳壹账通智能科技有限公司 | Information restoration method and device, computer system and readable storage medium |
CN112214987A (en) * | 2020-09-08 | 2021-01-12 | 深圳价值在线信息科技股份有限公司 | Information extraction method, extraction device, terminal equipment and readable storage medium |
CN112214987B (en) * | 2020-09-08 | 2023-02-03 | 深圳价值在线信息科技股份有限公司 | Information extraction method, extraction device, terminal equipment and readable storage medium |
CN112100366A (en) * | 2020-09-17 | 2020-12-18 | 广联达科技股份有限公司 | Pavement structure layer display method and device, computer equipment and storage medium |
CN112100366B (en) * | 2020-09-17 | 2023-10-27 | 广联达科技股份有限公司 | Pavement structure layer display method and device, computer equipment and storage medium |
CN113822030A (en) * | 2020-10-19 | 2021-12-21 | 北京沃东天骏信息技术有限公司 | Data export method, device and storage medium |
CN112434096A (en) * | 2020-11-30 | 2021-03-02 | 上海天旦网络科技发展有限公司 | Service analysis system and method based on intelligent label |
CN112434096B (en) * | 2020-11-30 | 2023-05-23 | 上海天旦网络科技发展有限公司 | Intelligent tag-based service analysis system and method |
CN112597353B (en) * | 2020-12-18 | 2024-03-08 | 武汉大学 | Text information automatic extraction method |
CN112597353A (en) * | 2020-12-18 | 2021-04-02 | 武汉大学 | Automatic text information extraction method |
CN113342811A (en) * | 2021-05-31 | 2021-09-03 | 中国工商银行股份有限公司 | HBase table data processing method and device |
CN113761044A (en) * | 2021-08-30 | 2021-12-07 | 上海快确信息科技有限公司 | Labeling system method for labeling text into table |
CN113872963B (en) * | 2021-09-26 | 2023-09-29 | 中水北方勘测设计研究有限责任公司 | Method and system for rapidly analyzing message protocol based on free label splicing technology |
CN113872963A (en) * | 2021-09-26 | 2021-12-31 | 中水北方勘测设计研究有限责任公司 | Message protocol rapid analysis method and system based on free label splicing technology |
CN113962328A (en) * | 2021-11-12 | 2022-01-21 | 上海冰鉴信息科技有限公司 | Data comparison analysis method, device and equipment |
CN114692792B (en) * | 2022-03-22 | 2022-11-04 | 深圳市利和兴股份有限公司 | Makeup radio frequency identification testing platform |
CN114692792A (en) * | 2022-03-22 | 2022-07-01 | 深圳市利和兴股份有限公司 | Makeup radio frequency identification testing platform |
CN115545008B (en) * | 2022-11-29 | 2023-04-07 | 明度智云(浙江)科技有限公司 | Spectrogram file analyzing method, device, equipment and storage medium |
CN115545008A (en) * | 2022-11-29 | 2022-12-30 | 明度智云(浙江)科技有限公司 | Spectrogram file analyzing method, device, equipment and storage medium |
CN117010349A (en) * | 2023-09-28 | 2023-11-07 | 杭州今元标矩科技有限公司 | Form filling method, system and storage medium based on neural network model |
CN117010349B (en) * | 2023-09-28 | 2023-12-19 | 杭州今元标矩科技有限公司 | Form filling method, system and storage medium based on neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN109062874A (en) | 2018-12-21 |
CN109062874B (en) | 2022-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019237540A1 (en) | Method and device for acquiring financial data, terminal device, and medium | |
CN108874928B (en) | Resume data information analysis processing method, device, equipment and storage medium | |
CN108932294B (en) | Resume data processing method, device, equipment and storage medium based on index | |
WO2019080402A1 (en) | Text information extraction method for structured text, storage medium and server | |
CN110909123B (en) | Data extraction method and device, terminal equipment and storage medium | |
US9817875B2 (en) | Methods and systems for automated data characterization and extraction | |
CN110851598A (en) | Text classification method and device, terminal equipment and storage medium | |
WO2019242125A1 (en) | Method and apparatus for acquiring upstream and downstream relationships between companies, terminal device and medium | |
CN104699785A (en) | Paper similarity detection method | |
CN115687655A (en) | PDF document-based knowledge graph construction method, system, equipment and storage medium | |
CN116127105A (en) | Data collection method and device for a big data platform | |
US20220215274A1 (en) | Explainable unsupervised vector representation of multi-section documents | |
CN110532449B (en) | Method, device, equipment and storage medium for processing service document | |
CN114743012A (en) | Text recognition method and device | |
US8977635B2 (en) | Device, method of processing data, and computer-readable recording medium | |
CN113033177B (en) | Method and device for analyzing electronic medical record data | |
CN113255369A (en) | Text similarity analysis method and device and storage medium | |
CN109740130B (en) | Method and device for generating file | |
CN111428497A (en) | A method, device and equipment for automatically extracting investment information | |
KR20200036333A (en) | Document analysis-based key element extraction system and method | |
CN110909112B (en) | Data extraction method, device, terminal equipment and medium | |
CN114115831A (en) | Data processing method, device, equipment and storage medium | |
CN110909538B (en) | Question and answer content identification method and device, terminal equipment and medium | |
CN114495138A (en) | Intelligent document identification and feature extraction method, device platform and storage medium | |
CN113722278A (en) | PDF file-based method, device and medium for extracting knowledge elements |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18922559 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18922559 Country of ref document: EP Kind code of ref document: A1 |