CN114201620A - Method, apparatus and medium for mining PDF tables in PDF file - Google Patents

Method, apparatus and medium for mining PDF tables in PDF file Download PDF

Info

Publication number
CN114201620A
CN114201620A CN202111554602.8A CN202111554602A CN114201620A CN 114201620 A CN114201620 A CN 114201620A CN 202111554602 A CN202111554602 A CN 202111554602A CN 114201620 A CN114201620 A CN 114201620A
Authority
CN
China
Prior art keywords
text information
pdf
target keyword
determining
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111554602.8A
Other languages
Chinese (zh)
Inventor
殷佳春
徐正昀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Suntime Information Technology Co ltd
Original Assignee
Shanghai Suntime Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Suntime Information Technology Co ltd filed Critical Shanghai Suntime Information Technology Co ltd
Priority to CN202111554602.8A priority Critical patent/CN114201620A/en
Publication of CN114201620A publication Critical patent/CN114201620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure relate to a method, apparatus, and medium for mining PDF tables in a PDF file, wherein the method includes setting a target keyword and configuration information associated with the target keyword for a PDF table; analyzing the PDF file so as to obtain text information in the PDF file; determining first text information based on the configuration information and the acquired text information; determining second text information based on a predefined position of the target keyword in the first text information and the first text information; verifying whether the second text information belongs to a feature row of the PDF table based on the extracted features of the PDF table so as to determine the feature row of the PDF table; determining a feature column of the PDF table based on the target keyword and the first text information; and acquiring the text information of the cells of the PDF table according to the determined characteristic rows and characteristic columns.

Description

Method, apparatus and medium for mining PDF tables in PDF file
Technical Field
Embodiments of the present disclosure relate generally to the field of data processing, and more particularly, to a method, computing device, and computer-readable storage medium for mining PDF tables in a PDF file.
Background
PDF (Portable Document Format) is an electronic Document Format developed by Adobe corporation, which has a characteristic of independence from an operating system platform. PDF belongs to a layout document, and pages are relatively independent, so that the document layout can be accurately described and the document layout can be displayed. However, the PDF does not record the logical structure of the document. Therefore, a solution for mining PDF tables in a PDF file is required. Mining the PDF forms in the PDF file includes identifying the table structure of the PDF forms and extracting form data from the identified table structure.
Conventional schemes for mining PDF forms include: respectively identifying form lines and form contents in a PDF form; and extracting the table by an image processing method. In the first scheme, table line segments may be drawn by individual path operators, which may be expressed as elements such as formula lines, vector diagrams, and turning characters, and the table contents include various types of character tables, which are often mixed with other contents of the layout and are not easy to be mined. For the second scheme, the image is required to recognize table line segments, so as to determine the table frame and extract the area in the frame, and finally, OCR recognition is performed on the image of the area in the frame.
In conventional approaches for mining PDF tables, table identification needs to be based on salient table line segments. If a table hides its table line segments or the table line segments are embodied in an irregular manner, the traditional table mining scheme often cannot accurately identify the table. Meanwhile, the accuracy rate of directly reading table contents including situations of character mixing, superposition, offset and the like is not high.
In summary, the conventional solution for mining PDF tables in a PDF file has the following disadvantages: the line segment identification cells must be relied upon for the table identification and the contents within the cells of the PDF table cannot be accurately extracted.
Disclosure of Invention
In view of the above problems, the present disclosure provides a method, a computing device, and a computer-readable storage medium for mining PDF tables in a PDF file, which can accurately extract complex table contents without identifying table identification line segments.
According to a first aspect of the present disclosure, there is provided a method for mining PDF tables in a PDF file, comprising: setting a target keyword and configuration information associated with the target keyword for a PDF table; analyzing the PDF file so as to obtain text information in the PDF file; determining first text information based on the configuration information and the acquired text information; determining second text information based on the predefined position of the target keyword in the first text information and the first text information; verifying whether the second text information belongs to a feature line of the PDF table based on the extracted features of the PDF table so as to determine the feature line of the PDF table; determining a feature column of a PDF table based on the target keyword and the first text information; and acquiring the text information of the cells of the PDF table according to the determined characteristic rows and characteristic columns.
According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the disclosure.
In a third aspect of the present disclosure, a non-transitory computer readable storage medium is provided having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.
In some embodiments, verifying whether the second text information belongs to a feature row of the PDF table comprises: if the second text information accords with the characteristics of the PDF table, determining that the second text information belongs to the characteristic line of the PDF table; and if the second text information does not conform to the characteristics of the PDF form, adjusting the predefined position of the target keyword in the first text information to re-determine the second text information.
In some embodiments, determining the second textual information includes: determining a right abscissa, an upper ordinate and a lower ordinate of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information; in the first text information, text information satisfying at least one of the following items is determined as candidate text information: the right abscissa of the target keyword is opposite to the right; having an upper ordinate that differs from the upper ordinate of the target keyword by a first threshold; a lower ordinate having a difference by a first threshold from a lower ordinate of the target keyword; and extracting text information which is on the same page as the target keyword and is different from the target keyword from the determined candidate text information as second text information.
In some embodiments, determining the second textual information includes: determining a left abscissa, a right abscissa and a lower ordinate of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information; in the first text information, text information satisfying at least one of the following items is determined as candidate text information: is relatively below the lower ordinate of the target keyword; having an upper ordinate that differs from a lower ordinate of the target keyword by a second threshold; a left abscissa and a right abscissa having an intersection with a region from the left abscissa to the right abscissa of the target keyword; extracting text information which is on the same page as the target keyword, is different from the target keyword and does not accord with the characteristics of the PDF table from the determined candidate text information as a middle character; and determining second text information based on the extracted intermediate characters.
In some embodiments, determining the second textual information includes: determining the length, the width, the lower ordinate and the right abscissa of the target keyword in a pixel coordinate system of the PDF file based on the predefined position of the target keyword in the first text information; in the first text information, text information satisfying at least one of the following items is determined as candidate text information: the lower ordinate of the target keyword is relatively below; the right abscissa of the target keyword is at the relative right; having an abscissa, an ordinate lying in a range made up of a third threshold multiple to the width of the target keyword and a fourth threshold multiple to the length of the target keyword; and extracting text information which is on the same page as the target keyword, is different from the target keyword and meets the characteristics of the PDF table from the determined candidate text information as second text information.
In some embodiments, determining the second textual information includes: determining an upper ordinate and a left abscissa of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information; in the first text information, text information satisfying at least one of the following items is determined as candidate text information: the keyword is positioned above the upper ordinate of the target keyword; the right abscissa of the target keyword is opposite to the right; a left abscissa having a difference from the left abscissa of the target keyword by a fifth threshold range; and extracting text information which is on the same page as the target keyword, is different from the target keyword, meets the characteristics of the PDF table and has a field length within a sixth threshold range from the determined candidate text information as second text information.
In some embodiments, determining the second textual information includes: sequencing the determined second text information; verifying whether the second text information which is sequenced has a character string; and if the second text information which is sequenced has the text character string, removing the text character string and all texts of the text character string behind the abscissa.
In some embodiments, verifying whether the second text information belongs to a feature row of the PDF table comprises: determining a regular expression expressing the characteristics based on the characteristics in the characteristic row of the PDF table; verifying whether the second text information conforms to the regular expression; and if the second text information accords with the regular expression, determining a characteristic row of a behavior PDF table where the second text information is located.
In some embodiments, determining the feature column of the PDF table further comprises: determining a left abscissa and a right abscissa of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information; determining text information of a left abscissa and a right abscissa having an intersection with a region from the left abscissa to the right abscissa of the target keyword in the first text information; extracting text information which is on the same page as the target keyword, is different from the target keyword and does not comprise a preset special character from the determined text information; and determining the extracted text information as a feature column of the PDF table.
In some embodiments, obtaining the text information of the cell of the PDF table according to the determined feature row and feature column further comprises: determining a longitudinal axis coordinate and a transverse axis coordinate of the PDF table based on the determined left horizontal coordinate and right horizontal coordinate of the feature row of the PDF table and the upper vertical coordinate and lower vertical coordinate of the feature column of the PDF table; and acquiring cells of the PDF table based on the longitudinal axis coordinate and the horizontal axis coordinate of the PDF table.
In some embodiments, obtaining a cell of a PDF table based on its vertical axis coordinates and horizontal axis coordinates comprises: determining left horizontal coordinates, right horizontal coordinates, upper vertical coordinates and lower vertical coordinates of all text information in the feature rows and the feature columns of the PDF table based on the feature rows and the feature columns of the PDF table; and acquiring text information of a cell of the PDF table, wherein the text information satisfies at least one of the following items: having a left abscissa differing from the left abscissas of all text information of the feature line by a seventh threshold; having a right abscissa differing from the right abscissas of all text information of the feature line by a seventh threshold; having an upper ordinate that differs by an eighth threshold value from the upper ordinates of all the text messages of the feature column; having a lower ordinate differing by an eighth threshold value from the lower ordinate of all text information of the feature line.
In some embodiments, the above method further comprises: determining the characteristic line as a year information line; determining the characteristic column as an index identification column; text information having the same row position information as the index identification and below the year information row, which is located to the right of the index identification column, is determined as the numerical value associated with the year information and the index identification.
In some embodiments, the above method further comprises: constructing a mechanism key feature array for a plurality of mechanisms associated with a PDF form, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond; based on the mechanism key feature array, searching the text information extracted based on the PDF table so as to determine the occurrence frequency of key features associated with the mechanism; and generating a mechanism weight sequence based on the calculated number of times the key features associated with the mechanism occur for determining a target associated mechanism of the PDF form.
In some embodiments, determining the target association of the PDF form further comprises: determining a mechanism corresponding to the maximum value in the mechanism weight sequence; determining whether the number of mechanisms corresponding to the maximum value is 1; in response to determining that the number of mechanisms corresponding to the maximum value is 1, determining that the mechanism corresponding to the maximum value is a target associated mechanism of the PDF table; and determining that the target-associated entity is not identified in response to determining that the number of entities corresponding to the maximum value is greater than 1.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.
Fig. 1 shows an example diagram of a PDF form used according to an embodiment of the present disclosure.
Fig. 2 shows a schematic diagram of a system 200 for implementing a method for mining PDF tables in a PDF file according to an embodiment of the present disclosure.
Fig. 3 shows a flow diagram of a method 300 for mining PDF tables in a PDF file according to an embodiment of the present disclosure.
Fig. 4 shows a flow chart of a method 400 of verifying whether the second text information belongs to a feature line of a PDF table according to an embodiment of the present disclosure.
Fig. 5 shows a flow diagram of a method 500 of determining second textual information, in accordance with an embodiment of the present disclosure.
Fig. 6 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure.
Fig. 7 shows a flowchart of a method 700 of determining second textual information, according to an embodiment of the present disclosure.
Fig. 8 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure.
Fig. 9 shows a flowchart of a method 900 of determining second textual information, according to an embodiment of the present disclosure.
Fig. 10 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure.
Fig. 11 shows a flow diagram of a method 1100 of determining second textual information, in accordance with an embodiment of the present disclosure.
Fig. 12 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure.
Fig. 13 shows a flow diagram of a method 1300 of determining second textual information, according to an embodiment of the disclosure.
Fig. 14 shows a flowchart of a method 1400 of verifying whether the second text information belongs to a feature line of a PDF table according to an embodiment of the present disclosure.
Fig. 15 shows a flow diagram of a method 1500 of determining a feature column of a PDF table according to an embodiment of the present disclosure.
Fig. 16 shows a flowchart of a method 1600 of obtaining cell information of a PDF table according to determined feature rows and feature columns according to an embodiment of the present disclosure.
Fig. 17 shows a flowchart of a method 1700 of obtaining cells of a PDF table based on the ordinate of the vertical axis and the abscissa of the PDF table according to an embodiment of the present disclosure.
Fig. 18 shows a flow diagram of a method 1800 for mining PDF tables in a PDF file according to an embodiment of the present disclosure.
Fig. 19 shows a flowchart of a method 1900 for mining PDF tables in a PDF file according to an embodiment of the present disclosure.
Fig. 20 illustrates a flow diagram of a method 2000 for determining a target association mechanism for a PDF form according to an embodiment of the present disclosure.
Fig. 21 shows a block diagram of an electronic device 2100, in accordance with an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, tables with hidden table line segments or tables constructed in an irregular manner such as a fractal line, a vector diagram, a character of a turn cannot be mined by the conventional method for mining PDF tables in a PDF file. Conventional table mining schemes may also fail to accurately extract table content without accurately identifying table segments. Because the PDF table has no uniform format, the contents of the table may have mixed, overlapped, and offset characters, and reading the contents of the table directly may cause the data between cells to be cross-scrambled. Meanwhile, the data reading disorder of a plurality of cells may be transmitted to the mining process of adjacent cells in a linkage manner, and finally, the mining error or the processing invalidation of the whole table content is caused. This is particularly true in forms in the financial field where there are a large number of digits and cell separations are not apparent (e.g., profit sheets, asset balance sheets, cash flow sheets).
Fig. 1 shows an example diagram of a PDF form used according to an embodiment of the present disclosure. As shown in fig. 1, in the PDF form of the daily processing, the PDF form does not include a partition line segment for reasons of beauty and the like, and data is not partitioned using a distinct cell partition line. In this case, although some PDF parsing tools in the prior art can parse a PDF file into readable data, they cannot correspond various types of data in a table to table logic, which causes confusion of the read data. This is even more pronounced in data-based financial forms (profit sheets, balance sheets, cash flow sheets, etc.).
Taking the profit and liability tables of FIG. 1 as an example, directly parsing them may present a number of problems. For example, the table name (header) may be in a different location that may confuse the table identification. In fig. 1, the form name of the form profit sheet ("profit sheet" textual information) is next to the year information line, and the form name of the balance sheet ("balance sheet" textual information) is below one form apart from the year information line. The profit and balance sheets actually belonging to the two tables may be merged into one integrated table. Second, the data of the profit or liability statement may not correspond to the corresponding actual year (2019A) and predicted year (2020E, 2021E, 2022E). The parsing tool of the partially parsed PDF file may parse the table into cells such as "business cost", "2019A", "1, 064", but it still cannot identify the cell "business income" by corresponding the cell "1, 064" to the year cell "2019A" above it and the corresponding index.
Due to the non-correspondence of the upper cell and the lower cell, the data loses the corresponding additional information and becomes meaningless junk data. This situation may be exacerbated in tables that include more complex strings. For example, in a cell including a line feed of long data, only a part of the data may be read, while another part of the data is directly lost.
To at least partially address one or more of the above problems, and other potential problems, example embodiments of the present disclosure propose a scheme for mining PDF tables in PDF files, in which a plurality of table keys at different positions are defined, so as to determine a feature row of a PDF table by the defined table keys. The feature column may further be determined based on the feature rows of the PDF table. After the feature rows and feature columns of the PDF table are determined, the feature rows and feature columns may be combined into a table structure to extract the data in the table and preserve the actual meaning of each data.
The present disclosure can locate each data cell from the determined feature rows and feature columns without relying on an explicit table structure or logical framework, thereby enabling the accurate mining of PDF forms into structured data, e.g., in the form of Excel dataforms, XML files, YAML files, etc.
In addition, the present disclosure also provides a corresponding method for further processing (e.g., data deep mining, mechanism identification, etc.) of the mined data, thereby improving the fineness of the mined data.
Fig. 2 shows a schematic diagram of a system 200 for implementing a method for mining PDF tables in a PDF file according to an embodiment of the present disclosure. As shown in fig. 2, the system 200 includes a computing device 210 and a PDF file management device 230 and a network 240. The computing device 210, the PDF file management device 230 may interact with data through a network 240 (e.g., the internet).
The PDF file management device 230 may perform, for example, a general management of PDF files, such as collection and storage of PDF files. The PDF file management device 230 may also send the managed PDF files to the computing device 210. The PDF file management device 230 is, for example and without limitation: desktop computers, laptop computers, netbook computers, tablet computers, web browsers, e-book readers, Personal Digital Assistants (PDAs), wearable computers (such as smart watches and activity tracker devices), and the like, that can perform PDF file reading and modification. The PDF file management device 230 may be configured to store PDF files, send PDF files to the computing device 210 via the network 240, and receive PDF files from the computing device 210 processes.
With respect to the computing device 210, it is used, for example, to receive PDF files from the PDF file management device 230 via the network 240; a PDF table is mined for the received PDF file. The computing device 210 may also determine a target association mechanism for the PDF tables based on the mined PDF tables and determine a numerical value associated with the year data and the index identification. Computing device 210 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as a CPU. Additionally, one or more virtual machines may also be running on each computing device 210. In some embodiments, the computing device 210 and the PDF file management device 230 may be integrated or may be provided separately from each other. In some embodiments, the computing device 210 includes, for example, a keyword setting unit 212, a PDF parsing unit 214, a first text determining unit 216, a second text determining unit 218, a second text verifying unit 220, a feature column determining unit 222, a text information acquiring unit 224, and an additional operating unit 226.
The keyword setting unit 212 may be configured to set a target keyword and configuration information associated with the target keyword with respect to the PDF table.
The PDF parsing unit 214 may be configured to parse a PDF file to obtain text information in the PDF file.
The first text determination unit 216 may be configured to determine the first text information based on the configuration information and the acquired text information.
The second text determining unit 218 may be configured to determine the second text information based on the predefined position of the target keyword in the first text information and the first text information.
The second text verification unit 220 may be configured to verify whether the second text information belongs to a feature row of the PDF table based on the extracted features of the PDF table to determine the feature row of the PDF table.
The feature column determination unit 222 may be configured to determine a feature column of the PDF table based on the target keyword.
The text information obtaining unit 224 may be configured to obtain the text information of the cells of the PDF table according to the determined feature rows and feature columns.
The additional operation unit 226 may be configured to determine a target association mechanism of the PDF table, and determine a numerical value associated with the year data and the index identification, and the like additional operations.
Elements 212-224 may retrieve form text information in the PDF file based on the form keywords. After the associated table text information is preliminarily analyzed, the coordinates of each text information are determined in a plane coordinate system mode. The feature rows in the table, such as the year information rows, are determined based on the position of the table key. If the feature row is not found from the position of the table key, the position of the table key is adjusted until the feature row of the table is determined. Meanwhile, the table feature column can be determined according to the position of the table keyword. And after the acquired feature rows and feature columns are obtained, the position of each text message can be acquired through combination, so that the PDF table in the PDF file is accurately mined.
The additional operation unit 226 may also perform various operations on the mined PDF tables based on the PDF tables mined by the units 212 to 224. Processing includes, but is not limited to, determining a target association mechanism for the PDF table, determining a numerical value associated with the year data and the indicator identification, and the like. After the above processing is completed for the PDF file, the processed PDF form may be transmitted to the PDF file management device 230 via the network 240.
Some examples in the following will use the PDF form shown in fig. 1 as an example to illustrate the working principle of the technical solution of the present disclosure, however, it is understood that the PDF form to which the present disclosure is applicable may be represented in many different forms, and is not limited to the representation form of the form in fig. 1.
A method 300 for mining PDF tables in a PDF file is described below with reference to fig. 1 and 2. Fig. 3 illustrates various paths and orders for the purpose of presenting the working principle of the method for mining PDF tables in a PDF file as a whole, but some of the paths and paths are not necessary for implementing the following example, and various methods according to the technical solution of the present disclosure may be performed in different orders and paths.
In the context of the present disclosure, text information may also be referred to as text blocks, text boxes, and the like, which represent characters or continuous strings of characters parsed by PDF text parsing tools commonly used in the art. The cell refers to a minimum constituent unit in a table structure of the PDF table, which includes text information.
In the context of the present disclosure, text information in a PDF file is located by pixel coordinates of the PDF file. The pixel coordinate values of the text information may be lateral coordinate values and longitudinal coordinate values in a coordinate system established with the pixels of the PDF file. In the field of PDF file processing, the vertex at the top left of a PDF file is usually used as the origin, the right side thereof is used as the horizontal axis, and the right side thereof is used as the vertical axis.
It is understood that a similar plane coordinate system established with other points as origins on the basis of the corresponding coordinate conversion can also be applied to the method for mining the PDF table in the PDF file provided by the present disclosure.
Fig. 3 shows a flow diagram of a method 300 for mining PDF tables in a PDF file according to an embodiment of the present disclosure. The method 300 may be performed by the computing device 210 as shown in FIG. 2, or may be performed at the electronic device 2100 as shown in FIG. 21. It should be understood that method 300 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.
At step 302, the computing device 210 sets a target keyword and configuration information associated with the target keyword for a PDF table.
In some examples, the target keywords may be different types of keywords set by the user as desired. Taking financial forms as an example, the target keyword may be a form name such as "balance sheet", "profit sheet", "cash flow sheet", and the like. These keywords are typically associated with the PDF form desired by the user. However, the relative positions of these target keywords and the PDF tables are not fixed.
In some examples, the target keyword (table name) is typically to the relative right of the feature row (e.g., year information row) of the PDF table. In some examples, the target keyword (table name) is typically relatively above or relatively to the right of a feature row (e.g., year information row) of the PDF table. In some examples, the target keyword (table name) is typically relatively below or relatively to the right of a feature row (e.g., year information row) of the PDF table. Hereinafter, the above cases will be specifically described separately.
After setting the target keyword for the PDF table, the computing device 210 will set the associated configuration information for the target keyword. The configuration information is used for screening the PDF file, so that the omission of PDF tables is avoided, and meanwhile, the acquisition of noise information about the tables is avoided. The configuration information associated with the target keyword includes, but is not limited to, a positive filter name of the form name, a negative filter name, and other PDF form filter conditions. The positive filter name may be a name that should be retained similar to a PDF table name, while the negative filter name is a name that is similar to a PDF table name but should be culled. Taking the target keyword "balance sheet" as an example, the positive screening names may be "balance profile", "balance summary table", etc., and the negative screening names may be "balance sheet", "balance rate", etc. A variety of different configuration information may be set for one target keyword.
By setting the target keyword and the configuration information associated with the target keyword, the table area of the PDF table can be accurately identified in the following method and steps, and the table information in the PDF text can be retained as much as possible while table noise is eliminated.
At step 304, the computing device 210 parses the PDF file to obtain the text information in the PDF file.
In some examples, the computing device 210 may directly use or indirectly invoke PDF processing tools commonly used in the art to parse all the text information of the PDF file in which the PDF form is located for subsequent processing into the corresponding PDF form and cell content.
In some examples, the computing device 210 may also parse only all text information on the page of the PDF file where the target keyword is located or text information within a predetermined range (e.g., 300 coordinate pixel ranges up and down centered on the target keyword) around the target keyword in order to speed up the parsing and retrieving of the text information in the PDF file. The predetermined range of parsing may be adjusted by the user based on experience with the form processing.
Common PDF processing tools used in the present disclosure may include any code, software, library files that can parse PDF text, such as software packages or software libraries invokable by Python, Java, etc. programming languages, including but not limited to PDFminer, Camelot, etc. software libraries.
Note that acquiring a plurality of text blocks within a table area using a processing tool commonly used in the field of PDF processing simply acquires the text content of the text blocks, i.e., characters or character strings recognized as available for processing. The processing tool does not identify the form logic of the PDF form, e.g., cannot identify associations between text content of multiple text blocks in the form.
By analyzing the PDF file, the text information in the PDF file can be preliminarily obtained, so that the speed and the accuracy of extracting the characteristic lines and the characteristic columns of the table and the text information of the corresponding cell blocks in the subsequent steps are improved.
In step 306, the computing device 210 determines first textual information based on the configuration information and the obtained textual information.
In some examples, based on the target keywords and configuration information determined in step 302 and the text information in the PDF file parsed in step 304, computing device 210 performs data pre-processing and extraction to determine the first text information.
The data preprocessing may be data scrubbing based on user-defined preset conditions and preset characters. The preset conditions comprise blank line merging, same page reservation and other text processing conditions. The preset characters may include preset exclusion characters and preset reserved characters. The preset exclusion characters include, for example, line breaks, space bars, and other meaningless characters. The preset reserved characters include, for example, numeric characters, alphabetic characters, and the like having practical meanings.
The data preprocessing further includes retrieving a target keyword from the text information in the PDF file according to the configuration information. The retrieval optionally may include screening and retaining the positively similar target keywords according to the positive screening names in the configuration information, while rejecting the negatively similar target keywords according to the negative screening names. After the correlation process in step 306, the computing device 210 may obtain first text information, where the first text information includes a target keyword (and similar target keywords).
By determining the first text information, preprocessing of the PDF table in the PDF file can be realized, so that the second text information of the PDF table can be determined more quickly and accurately.
In step 308, the computing device 210 determines second textual information based on the predefined location of the target keyword in the first textual information and the first textual information.
In some examples, the user may set a default predefined location and a plurality of predefined locations for the target keyword in the first textual information. For example, the user may set the default predefined position of the target keyword at the upper left relative to the feature row (e.g., year information row) of the PDF table and set a plurality of predefined positions. The predefined locations include, but are not limited to, the front left of the target keyword relative to the feature row of the PDF table, the lower left of the target keyword relative to the feature row of the PDF table, the upper left of the feature row of the PDF table, and so on. The predefined location may be flexibly set by the user according to the characteristics of the PDF table and is not limited to the above-mentioned location.
The computing device 210 may retrieve the target keyword in the first text information to determine the second text information of the PDF form according to a possible predefined location of the target keyword. The second text information should be text information belonging to a feature line of the PDF table. For example, if the feature line of the PDF table is a year information line, the second text information should be one or more year characters or year character strings, e.g., 2020A, etc.
This step may have a variety of implementations depending on the variety of possible predefined locations of the target keyword. The manner of determining the second text information of the PDF table will be specifically described below by way of example.
By determining the second text information of the PDF form, possible feature rows of the PDF form can be found, so that the respective longitudinal position coordinates of the cells in the PDF form can be determined based on the feature rows.
In step 310, the computing device 210 verifies whether the second text information belongs to a feature row of the PDF table based on the extracted features of the PDF table to determine the feature row of the PDF table.
In some examples, the user may set different features for the extracted PDF form for verifying the feature rows of the PDF form. For example, in the case where the feature row of the PDF table is a year feature row, a regular expression of the year may be set to identify the year feature.
Based on the characteristics of the PDF form, the computing device 210 may verify whether the second text information belongs to a characteristic row of the PDF form. For example, the computing device 210 may verify whether the extracted year character or character string (second text information) belongs to the year information line (feature line) of the PDF form. If the second text information belongs to a feature row of the PDF table, the row in which the second text information is located may be determined as the feature row of the PDF table. Otherwise, the computing device 210 will return to step 308 to re-adjust the predefined locations of the target keywords and re-determine the second textual information.
The manner of verifying whether the second text information belongs to the feature line of the PDF table will be specifically described below by way of example.
At step 312, the computing device 210 determines a feature column of the PDF table based on the target keyword and the first text information.
In some examples, since the predefined position of the target keyword is related to the feature column of the PDF table, the feature column of the PDF table is determined according to the determined predefined position of the target keyword and the acquired first text information. The feature column of the PDF table may be, for example, the first column of the table, i.e., the index identification column.
Based on the determined feature columns, respective lateral position coordinates of the cells in the PDF table may be determined.
The manner in which the feature columns of the PDF table are determined will be described in detail below.
At step 314, the computing device 210 obtains text information for the cells of the PDF table based on the determined feature rows and feature columns.
In some examples, based on the feature rows and feature columns determined in steps 308 and 312, the horizontal and vertical coordinates of the cells of the PDF table may be determined in horizontal and vertical coordinates.
By combining the horizontal coordinates and the vertical coordinates, and optionally adding a threshold adjustment range, the position and the size of the text information corresponding to each cell of the PDF table can be determined, so that the text information of the cell of the PDF table can be accurately acquired by the PDF file parsing tool.
The manner of acquiring the text information of the cells of the PDF table will be described in detail below.
Fig. 4 shows a flow chart of a method 400 of verifying whether the second text information belongs to a feature line of a PDF table according to an embodiment of the present disclosure. The method 400 may be performed by the computing device 210 as shown in FIG. 2, or may be performed at the electronic device 2100 as shown in FIG. 21. Method 400 may correspond to step 310 in method 300.
At step 402, the computing device 210 performs a determination that the second text information belongs to a feature row of the PDF table if the second text information conforms to a feature of the PDF table.
In some examples, computing device 210 performs the determination on the second textual information. And if the second text information accords with the characteristics of the PDF table, determining that the second text information belongs to the characteristic row of the PDF table. The characteristics of the PDF form may be determined by the user from experience with PDF form processing. For example, for a financial type form, the features of a PDF form may include a year information row. Thus, features may be defined for the year information rows in the financial class table, e.g., regular expressions of year information.
The user may design a regular expression for the year (e.g., regular expression "^ (19|20) \ d {2} $", which may represent all the years from 1900-. Special regular expressions can also be configured to match the cells. A special year regular expression may match a year with a suffix letter, e.g., "2022E" with the letter "E" is identified as predicted year 2022.
If the second text information corresponds to a feature (regular expression) of the PDF table, it is determined in step 308 of the method 300 that the second text information belongs to a feature row of the PDF table. The method 300 will continue with a downward execution of value step 312.
At step 404, the computing device 210 performs a determination to adjust the predefined location of the target keyword in the first text information to re-determine the second text information if the second text information does not conform to the characteristics of the PDF table.
In some examples, if the second text information does not conform to the characteristics of the PDF table as above (e.g., a regular expression), step 310 in method 300 will return to step 308, readjusting the predefined location of the target keyword in the first text information. Since there are multiple different predefined locations in the first text message, the computing device 210 will re-determine the second text message in step 308 according to the target keyword for the different predefined locations, iteratively iterating through the multiple predefined locations until the second text message satisfies the characteristics of the PDF table.
And if the second text information obtained according to the target keywords at the different predefined positions cannot meet the characteristics of the PDF form, abandoning the form, and uploading to report errors or turning to manual processing.
By combining steps 402 and 404 of method 400 with step 308 of method 300, the position of the target keyword in the first text information may be iteratively determined, thereby determining the accurate second text information according to the determined position. By means of the second text information, the feature lines of the PDF form can be located.
As described in method 300, there are multiple different predefined locations of the target keyword in the first textual information, so in the next method 500-method 1100, a method is provided how to determine the second textual information by iterating through the different predefined locations of the target keyword.
Fig. 5 shows a flow diagram of a method 500 of determining second textual information, in accordance with an embodiment of the present disclosure. The method 500 may be performed by the computing device 210 as shown in FIG. 2, or may be performed at the electronic device 2100 as shown in FIG. 21. The method 500 may correspond to step 308 in the method 300.
In some examples, the position of the target keyword in the first text information may be predefined at the leftmost side with respect to the undetermined second text information or feature line, i.e. the second text information is at the co-line position of the target keyword.
In step 502, the computing device 210 determines a right abscissa, an upper ordinate, and a lower ordinate of the target keyword in a pixel coordinate system of the PDF file based on a predefined location of the target keyword in the first text information.
Fig. 6 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure. As shown in fig. 6, in such a PDF form, a target keyword ("financial statement (million yuan)") is located at the leftmost side of the right line (year information line) of the PDF form feature. The characteristics (year information) of the PDF form can be found after the target keyword. By means of the characteristics of the PDF form, second text information of the PDF form (e.g., "2019A") may be determined. The computing device 210 may therefore determine the right abscissa, the upper ordinate, and the lower ordinate of the target keyword ("financial statement (million dollars)") in the pixel coordinate system of the PDF file. The second text information to the right of the target keyword can be determined by means of the above three coordinates.
In step 504, the computing device 210 determines, among the first text information, text information satisfying at least one of the following as candidate text information: the right abscissa of the target keyword is opposite to the right; having an upper ordinate that differs from the upper ordinate of the target keyword by a first threshold; having a lower ordinate that differs from the lower ordinate of the target keyword by a first threshold.
In some examples, since the target keyword is leftmost in the first text information, the computing device 210 may determine the second text information in an area that is relatively to the right relative to the right abscissa of the target keyword. Since the second text information (e.g., year information) may have a certain pixel difference from the target keyword in the vertical axis direction. A threshold range of pixel values may thus be set on the vertical axis when determining the second text information. For example, the threshold range may be determined as 5 pixels in the longitudinal axis direction. For example, in the vertical axis direction, all text information whose upper vertical coordinate is less than 5 pixels apart from the upper vertical coordinate of the target keyword and whose lower vertical coordinate is less than 5 pixels apart from the lower vertical coordinate of the target keyword may be determined as candidate second text information, and verified in the next step.
Note that the threshold range may be adjusted according to the size of the text information in the PDF table. For example, a pixel interval of 1 pixel, 3 pixels, or 1-10 pixels may be used as the threshold range.
In step 506, the computing device 210 extracts, as the second text information, the text information that is on the same page as the target keyword and is different from the target keyword among the determined candidate text information.
In some examples, computing device 210 filters the textual information determined at step 504 according to some other additional criteria. Other additional conditions may include that the second text information should be on the same page as the target keyword, that the text information in the second text information is different from the target keyword, etc.
Exemplary pseudo code for determining second textual information is provided herein.
if abs(pi['y0']-table_key['y0'])<=5and abs(pi['y1']-table_key['y1'])<=5\
and pi['page_number']==table_key['page_number']\
and pi['number']!=table_key['number']\
and pi['x1']>table_key['x0']
In the pseudo code, parameters y0 and y1 represent upper and lower vertical coordinates, parameters x0 and x1 represent right and left horizontal coordinates, page _ number represents the number of pages, number represents the length of text, table _ key represents a target keyword, and pi represents second text information to be matched.
Therefore, a plurality of conditions in the pseudo code respectively correspond to that the difference between the upper ordinate and the lower ordinate of the second text information and the upper ordinate and the lower ordinate of the target keyword is within the threshold range; the second text information and the target keywords are on the same page; the second text information is different from the target keyword; and the right abscissa of the target keyword is smaller than the left abscissa of the second text information.
Note that what has been shown above is only pseudo code of the traversal operation, on the basis of which the above traversal operation can be implemented in any program language.
By performing the above steps and the conditions corresponding to the above steps, respectively, the computing device 210 may determine the second text information to the right of the target keyword by the position of the target keyword, and may then determine the feature line of the PDF table. The second text information is to be verified in the above verification step.
Fig. 7 shows a flowchart of a method 700 of determining second textual information, according to an embodiment of the present disclosure. The method 700 may be performed by the computing device 210 as shown in FIG. 2, or may be performed at the electronic device 2100 as shown in FIG. 21. Method 700 may correspond to step 308 in method 300.
In some examples, the position of the target keyword in the first text information may be predefined above with respect to the undetermined second text information or feature line, i.e. the second text information is in the lower line of the target keyword.
In step 702, the computing device 210 determines the left abscissa, the right abscissa, and the lower ordinate of the target keyword in the pixel coordinate system of the PDF file based on the predefined location of the target keyword in the first text information.
Fig. 8 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure. As shown in fig. 8, in such a PDF form, the target keywords ("financial statement and main financial ratio") are located above the PDF form feature line (year information line). The feature row of the PDF table can be found below the target keyword. By means of the characteristics of the PDF form, it can be determined that the second text information (e.g., "2019A") of the PDF form is preceded by an intermediate character ("financial statement (million dollars)"). The computing device 210 may therefore determine the left abscissa, right abscissa, and lower ordinate of the target keyword ("financial statement (million dollars)") in the pixel coordinate system of the PDF file. By means of the three coordinates, the middle character below the target keyword can be determined, and then the second text information after the middle character is found.
In step 704, the computing device 210 determines, among the first text information, text information satisfying at least one of the following as candidate text information: is relatively below the lower ordinate of the target keyword; having an upper ordinate that differs from a lower ordinate of the target keyword by a second threshold; the left abscissa and the right abscissa have intersection with a range from the left abscissa to the right abscissa of the target keyword.
In some examples, because the target keyword is above in the first text information, computing device 210 may determine the middle character in an area that is relatively below with respect to the lower ordinate of the target keyword, and then determine the second text information to the right of the middle character. Since the middle character (e.g., "finance report (million yuan)") is contained in the abscissa axis of the target keyword in the abscissa axis direction and the ordinate axis may have a certain pixel difference from the target keyword in the ordinate axis direction. The position of the middle character can therefore be determined on the basis of the horizontal axis coordinates and with a threshold value of a certain pixel value arranged on the vertical axis coordinates. For example, the horizontal axis coordinate of the middle character may intersect with the horizontal axis coordinate of the target keyword, and the lower vertical axis coordinate of the middle character may have a certain threshold range with respect to the upper vertical axis coordinate of the target keyword. For example, the lower ordinate of the target keyword differs from the upper ordinate of the middle character by less than 100 pixels. By determining the middle character of the feature line, the text information to the right of the middle character can be determined as the second text information and verified in the next step.
Note that the threshold range may be adjusted according to the size of the text information in the PDF table. For example, a difference of less than 50 pixels, a difference of less than 150 pixels, and a difference of less than 200 pixels may be used as the threshold range.
In step 706, the computing device 210 extracts, as intermediate characters, text information that is on the same page as the target keyword, that is different from the target keyword, and that does not conform to the characteristics of the PDF table, among the determined candidate text information.
In some examples, computing device 210 filters the textual information determined at step 504 according to some other additional criteria. Other additional conditions may include being on the same page as the target keyword, being different from the target keyword, and not conforming to a feature of the PDF table (e.g., a regular expression of year information), etc.
Exemplary pseudo code for determining an intermediate character is provided herein.
Figure BDA0003418739780000211
In the pseudo code, parameters y0 and y1 represent upper and lower vertical coordinates, parameters x0 and x1 represent right and left horizontal coordinates, page _ number represents the number of pages, number represents the length of text, and text represents text.
Therefore, a plurality of conditions in the pseudo code respectively correspond to intersections of intervals of the left abscissa and the right abscissa of the middle character and intervals of the left abscissa and the right abscissa of the target keyword; the middle character and the target keyword are on the same page; the middle character is different from the target keyword; regular expressions (features of PDF tables) where the intermediate characters do not conform to year information; the difference between the lower ordinate of the target keyword and the upper ordinate of the middle character is less than 100 pixels.
Note that what has been shown above is only pseudo code of the traversal operation, on the basis of which the above traversal operation can be implemented in any program language.
At step 708, the computing device 210 determines second text information based on the determined candidate intermediate characters.
By means of the middle character, the upper ordinate and the lower ordinate of the middle character can be determined as feature rows of the PDF table. The text information in the upper and lower ordinate sections of the middle character and to the right of its abscissa may be determined as the second text information.
By performing the above steps and the conditions corresponding to the above steps, respectively, the computing device 210 may determine, by the position of the target keyword, an intermediate character associated with the second text information below the target keyword, which may in turn be the second text information, thereby determining the feature line of the PDF table. The second text information is to be verified in the above verification step.
Fig. 9 shows a flowchart of a method 900 of determining second textual information, according to an embodiment of the present disclosure. The method 900 may be performed by the computing device 210 as shown in FIG. 2, or may be performed at the electronic device 2100 as shown in FIG. 21. Method 900 may correspond to step 308 in method 300.
In some examples, the position of the target keyword in the first text information may be predefined at the upper left relative to the undetermined second text information or feature line, i.e. the second text information is at the lower right position of the target keyword. In this case, however, the second text information can be determined directly without the aid of intermediate characters.
In step 902, the computing device 210 determines the length, width, lower ordinate, and right abscissa of the target keyword in the pixel coordinate system of the PDF file based on the predefined location of the target keyword in the first text information.
Fig. 10 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure. As shown in fig. 6, in such a PDF form, a target keyword ("balance sheet (million yuan)") is located on the upper right side of a PDF form feature row (year information row). The features (year information) of the PDF form can be found at the lower right after the target keyword. By means of the characteristics of the PDF form, second text information of the PDF form (e.g., "2019A") may be determined. The computing device 210 may therefore determine the length, width, lower ordinate, and right abscissa of the target keyword ("balance sheet") in the pixel coordinate system of the PDF file. With the above four values, a range for searching for the second text information to the right of the target keyword can be determined.
In step 904, the computing device 210 determines, among the first text information, text information satisfying at least one of: the lower ordinate of the target keyword is relatively below; the right abscissa of the target keyword is at the relative right; has an abscissa and an ordinate within a range of a third threshold multiple of the width of the target keyword and a fourth threshold multiple of the length of the target keyword.
In some examples, because the target keyword is at the top left in the first textual information, the computing device 210 may determine the second textual information in an area that is relatively below with respect to the lower ordinate of the target keyword and relatively to the right with respect to the right abscissa of the target keyword. The determination region may be determined according to a certain threshold multiple of the length and width of the target keyword. For example, the determination width of the determination region range may be a threshold of 2 times the width of the target keyword, and the determination length may be a threshold of 3 times the length of the target keyword. Taking the box in fig. 10 as an example, the determination region range is a rectangle centered on the target keyword, having a width 2 times the width of the target keyword and a length 3 times the length of the target keyword.
Note that the threshold determination range may be adjusted according to the size of the text information in the PDF table. For example, 2, 3, 5 different threshold multiples and combinations of threshold values may be used to determine the region range.
In step 906, the computing device 210 extracts, as the second text information, the text information of the regular expression that is on the same page as the target keyword, is different from the target keyword, and satisfies the characteristics of the PDF table, among the determined candidate text information.
In some examples, computing device 210 filters the textual information determined at step 904 according to some other additional criteria. Other additional conditions may include that the second text information should be on the same page as the target keyword, that the text information in the second text information is different from the target keyword, etc.
Exemplary pseudo code for determining second textual information is provided herein.
if pi['y0']>pi3['y1']\
and pi['x1']<pi3['x1']\
and pi['page_number']==pi3['page_number']\
and pi['number']!=pi3['number']\
and pi['y0']-pi3['y0']<=(pi['y1']-pi['y0'])*2\
and pi3['x0']-pi['x0']<=(pi['x1']-pi['x0'])*3\
and len(re.findall(r'20[0-2][0-9]',pi3['text'].replace(”,”)))>0
In the pseudo code, parameters y0 and y1 represent upper and lower vertical coordinates, parameters x0 and x1 represent right and left horizontal coordinates, page _ number represents the number of pages, number represents the length of text, and text represents text.
Therefore, the plurality of conditions in the pseudo code respectively correspond to the condition that the lower ordinate of the target keyword is larger than the upper ordinate of the second text information and the right abscissa of the target keyword is smaller than the left abscissa of the second text information; the second text information and the target keywords are on the same page; the second text information is different from the target keyword; the second text information conforms to a regular expression of the characteristics of the PDF table; and the second text information is located within a certain range of 2 times the width of the target keyword and 3 times the width of the target keyword (in a box in fig. 10).
Note that what has been shown above is only pseudo code of the traversal operation, on the basis of which the above traversal operation can be implemented in any program language.
By performing the above steps and the conditions corresponding to the above steps, respectively, the computing device 210 may determine the second text information at the lower right of the target keyword by the position of the target keyword, and may then determine the feature line of the PDF table. The second text information is to be verified in the above verification step.
Fig. 11 shows a flow diagram of a method 1100 of determining second textual information, in accordance with an embodiment of the present disclosure. The method 1100 may be performed by the computing device 210 as shown in fig. 2, or may be performed at the electronic device 2100 as shown in fig. 21. Method 1100 may correspond to step 308 of method 300.
In some examples, the position of the target keyword in the first text information may be predefined at the lower left relative to the undetermined second text information or feature line, i.e. the second text information is at the upper right position of the target keyword.
In step 1102, the computing device 210 determines the upper ordinate and the right abscissa of the target keyword in the pixel coordinate system of the PDF file based on the predefined location of the target keyword in the first text information.
Fig. 12 illustrates a schematic diagram of first text information including a target keyword according to an embodiment of the present disclosure. As shown in fig. 12, in such a PDF form, a target keyword ("balance sheet") is located at the lower left of a PDF form feature line (year information line). The feature (year information) of the PDF table can be found right above after the target keyword. By means of the characteristics of the PDF form, second text information of the PDF form (e.g., "2019A") may be determined. The computing device 210 may therefore determine the upper ordinate, left abscissa and right abscissa of the target keyword ("balance sheet (million yuan)") in the pixel coordinate system of the PDF file. With the above three values, the second text information can be determined right above the target keyword.
In step 1104, the computing device 210 determines, among the first text information, text information satisfying at least one of: the keyword is positioned above the upper ordinate of the target keyword; the right abscissa of the target keyword is opposite to the right; having a left abscissa differing from the left abscissa of the target keyword by a fifth threshold range.
In some examples, because the target keyword is at the lower left in the first textual information, computing device 210 may determine the second textual information relative to an area where the upper ordinate of the target keyword is relatively above and the right abscissa of the target keyword is relatively to the right. The left abscissa of the second text information differs from the left abscissa of the target keyword by a threshold range of 0.4 times the PDF file width.
Note that the threshold range may be adjusted according to the size of the text information in the PDF table. For example, 0.1 times, 0.5 times, 1 times the PDF file width may be used as the threshold range.
In step 1106, the computing device 210 extracts, as the second text information, the text information which is on the same page as the target keyword, different from the target keyword, satisfies the characteristics of the PDF table, and has a field length within a sixth threshold range, from among the determined candidate text information.
In some examples, computing device 210 filters the textual information determined at step 904 according to some other additional criteria. Other additional conditions may include that the second text information should be on the same page as the target keyword, that the characteristics of the PDF table are satisfied (e.g., that a regular expression is satisfied), that the text information in the second text information is different from the target keyword, that the field length is within 8 characters, etc. Note that the sixth threshold value may be adjusted according to the length of the text information. For example, 5 characters representing the length of the year, the number of characters such as 6 characters, etc. may be used as the sixth threshold value.
Exemplary pseudo code for determining second textual information is provided herein.
if pi4['y0']>pi['y1']\
and pi['x1']<pi4['x0']\
and pi4['x0']-pi['x0']<=pi['page_width']*0.4\
and pi['page_number']==pi4['page_number']\
and pi['number']!=pi4['number']\
and len(pi4['text'])<=8\
and len(re.findall(r'20[0-2][0-9]',pi4['text'].replace(”,”)))>0
In the pseudo code, parameters y0 and y1 represent upper and lower vertical coordinates, parameters x0 and x1 represent right and left horizontal coordinates, page _ number represents the number of pages, number represents the length of text, and text represents text.
Therefore, the plurality of conditions in the pseudo code respectively correspond to a PDF file width in which the lower ordinate of the target keyword is larger than the lower ordinate of the second text information and the difference between the left abscissa of the target keyword and the left abscissa of the second text information is smaller than 0.4 times; the second text information and the target keywords are on the same page; the second text information is different from the target keyword; the length of the text characters is less than 8 characters; the second text information conforms to a regular expression of the features of the PDF table.
Note that what has been shown above is only pseudo code of the traversal operation, on the basis of which the above traversal operation can be implemented in any program language.
By performing the above steps and the conditions corresponding to the above steps, respectively, the computing device 210 may determine the second text information at the upper right of the target keyword by the position of the target keyword, and may then determine the feature line of the PDF table. The second text information is to be verified in the above verification step.
Fig. 13 shows a flow diagram of a method 1300 of determining second textual information, according to an embodiment of the disclosure. In order not to dig up a plurality of pieces of second text information in the above process of determining the second text information, the determined second text information may optionally be processed.
In step 1302, the computing device 210 optionally ranks the determined second textual information.
In some examples, computing device 210 may perform ranking on the determined second text information. The sorting may use sorting algorithms commonly used in the art, such as bubble sorting, quick sorting, and the like.
At step 1304, computing device 210 verifies whether the second text information sorted has a literal string.
In some examples, computing device 210 may retrieve the ranked second text information, retrieving whether a literal character or a string of literal characters exists therein. Generally, if only one piece of second text information of a table is determined, the second text information should each satisfy a regular expression (year information) of the features of the PDF table without occurrence of literal characters such as chinese characters.
If a literal character, such as a Chinese character, appears, the determined second text information is considered to include the second text information of the other table. Computing device 210 may remove the text string and all text of the text string after the abscissa.
In step 1306, computing device 210 removes the text string and all text after the abscissa of the text string if the extracted text information has a text string.
Note that, in the method 1300, the step 1304 and the step 1306 may be performed on the second text information without performing the step 1302, i.e., the sorting step.
With the method 1300, it can be ensured that the mining method for PDF tables only mines related individual tables and does not involve other tables.
Fig. 14 shows a flowchart of a method 1400 of verifying whether the second text information belongs to a feature line of a PDF table according to an embodiment of the present disclosure.
At step 1402, the computing device 210 determines a regular expression expressing features based on features in the feature rows of the PDF table.
In some examples, the feature row of the PDF table may be a year information row, so the others may just be year information. Regular expressions can be set based on year information. For example, the regular expression "^ (19|20) \ d {2} $" representing all the years from 1900-.
Special regular expressions can also be configured to match the cells. A special year regular expression may match a year with a suffix letter, e.g., "2022E" with the letter "E" is identified as predicted year 2022.
At step 1404, computing device 210 verifies whether the second text information conforms to a regular expression.
In some examples, computing device 210 may verify whether the second textual information belongs to a feature line according to the regular expression determined in step 1402.
In step 1406, if the second text information conforms to the regular expression, the feature row of the behavior PDF table where the second text information is located is determined.
If the second text information belongs to the characteristic line, the characteristic line (year information line) of the behavior PDF table where the second text information is located is determined. If the regular expression is not satisfied, the computing device will re-determine the second text information according to step 308 and 312 of the method 300.
Through the method 1400, it may be verified whether the second text information determined in the previous step conforms to the feature of the PDF table, thereby determining the feature row of the PDF table.
Fig. 15 shows a flow diagram of a method 1500 of determining a feature column of a PDF table according to an embodiment of the present disclosure.
In step 1502, the computing device 210 determines the left and right abscissas of the target keyword in the pixel coordinate system of the PDF file based on the predefined location of the target keyword in the first text information.
In some examples, the computing device 210 may determine the left and right abscissas of the target keyword in a pixel coordinate system of the PDF file based on a predefined location of the target keyword in the first text information. The left and right abscissas will be used to locate the feature columns of the PDF table.
At step 1504, computing device 210 determines, in the first text information, text information having a left abscissa and a right abscissa that intersect an interval from the left abscissa to the right abscissa of the target keyword.
In some examples, since the feature column of the PDF table generally intersects with the PDF table target keyword, text information of a left abscissa and a right abscissa having an intersection with a left abscissa and a right abscissa interval of the target keyword may be defined as the text information of the feature column.
In step 1506, the computing device 210 extracts, among the determined text information, text information that is on the same page as the target keyword, that is different from the target keyword, and that does not include a predetermined special character.
In some examples, computing device 210 filters the textual information determined in step 1504 according to some other additional condition. Other additional conditions may include that the feature column should be on the same page as the target keyword, contain no numbers, contain no specific characters (e.g., "-"), the text information in the second text information is not identical to the target keyword, etc.
Exemplary pseudo code for determining a feature column of a PDF table is provided herein.
if pi['page_number']==col_name['page_number']\
and pi['number']!=col_name['number']\
and(
(pi['x0']<=col_name['x0']and pi['x1']>=col_name['x1'])
or(pi['x0']>=col_name['x0']and pi['x1']<=col_name['x1'])
or(pi['x0']<=col_name['x0']and pi['x1']<col_name['x1']and pi['x1']>col_name['x0'])
or(pi['x0']>col_name['x0']and pi['x0']<col_name['x1']and pi['x1']>=col_name['x1'])
or(pi['x0']>=col_name['x0']and(pi['x1']-col_name['x1'])<=50)
)\
and col_name['y0']>pi['y1']\
and pi['text'].replace(”,”)not in'-'\
and(check_no_number(get_col_value_after(pi['text']).replace(”,”))
and get_col_value_after(pi['text']).replace(”,”)!=”)
In the pseudo code, parameters y0 and y1 represent upper and lower vertical coordinates, parameters x0 and x1 represent right and left horizontal coordinates, page _ number represents the number of pages, number represents the length of text, and text represents text.
Therefore, a plurality of conditions in the pseudo code are respectively corresponding to the abscissa intervals of the text information of the feature column and are positioned in the abscissa interval of the target keyword; the lower ordinate of the target keyword is smaller than the upper ordinate of the text information of the characteristic column; the text information of the characteristic column and the target keyword are on the same page; the text information of the feature column contains no numbers; the text information of the feature column does not contain a special symbol (e.g., "-"); the text information of the feature column is different from the target keyword.
Note that what has been shown above is only pseudo code of the traversal operation, on the basis of which the above traversal operation can be implemented in any program language.
At step 1508, the computing device 210 determines the extracted text information as a feature column of the PDF table.
In some examples, the computing device 210 determines the text information determined in step 1506 as a feature column of a PDF table in a PDF file. In the diagrams of fig. 4 and 8, the feature column of the PDF table may be the first column of the PDF table, that is, the index identification column. Note that the feature column determined in the method 1500 also includes therein the vertical coordinates of the respective text information.
The determined feature columns may be combined with the feature rows determined in the above steps to determine a PDF table.
Fig. 16 shows a flowchart of a method 1600 of obtaining cell information of a PDF table according to determined feature rows and feature columns according to an embodiment of the present disclosure.
In step 1602, the computing device 210 determines a vertical axis coordinate and a horizontal axis coordinate of the PDF table based on the determined left and right horizontal coordinates of the feature row of the PDF table and the upper and lower vertical coordinates of the feature column of the PDF table;
in some examples, the computing device 210 may obtain the size of the entire PDF table, i.e., the ordinate and abscissa of the PDF table, by calculating the difference between the upper abscissa and the lower abscissa of the text information in the feature row (e.g., year information row) and the difference between the upper ordinate and the lower ordinate of the text information in the feature column (e.g., index identification column).
In step 1604, the computing device 210 obtains cells of the PDF table based on the ordinate and abscissa of the PDF table.
In some examples, the computing device 210 may combine the left and right abscissas of the text information in the determined feature rows (e.g., year information rows) of the PDF table and the upper and lower ordinates of the text information of the feature columns (e.g., index identification columns). By the combination of the four coordinates, the size of each cell in the PDF table, that is, the ordinate of the vertical axis and the abscissa of the cell or text information of the PDF table, can be determined.
Fig. 17 shows a flowchart of a method 1700 of obtaining cells of a PDF table based on the ordinate of the vertical axis and the abscissa of the PDF table according to an embodiment of the present disclosure.
At step 1702, the computing device 210 determines the left abscissa, the right abscissa, the upper ordinate, and the lower ordinate of all text information in the feature rows and feature columns of the PDF table based on the feature rows and feature columns of the PDF table.
In some examples, computing device 210 may determine a left abscissa, a right abscissa, an upper ordinate, and a lower ordinate of the text information therein based on the feature rows (e.g., year information rows) and the feature columns (e.g., index data columns) determined in method 1500 and method 1600.
At step 1704, the computing device 210 obtains text information for a cell of the PDF table that satisfies at least one of the following: having a left abscissa differing from the left abscissas of all text information of the feature line by a seventh threshold; having a right abscissa differing from the right abscissas of all text information of the feature line by a seventh threshold; having an upper ordinate that differs by an eighth threshold value from the upper ordinates of all the text messages of the feature column; having a lower ordinate differing by an eighth threshold value from the lower ordinate of all text information of the feature line.
In some examples, the computing device 210 may obtain the table information of the PDF table at the left abscissa, right abscissa, upper ordinate, and lower ordinate of the feature row (e.g., year information row) and the feature column (e.g., index data column) and add a certain threshold range thereto in order to ensure that the size of each cell is larger than the size of the text information therein. For example, 15 pixels may be added on the left abscissa, the right abscissa, and 5 pixels may be added on the upper ordinate, the lower ordinate. This means that the text information of 15 pixels outside the left abscissa, 15 pixels outside the right abscissa, 5 pixels outside the upper ordinate, and 5 pixels outside the lower ordinate of the original text information can be determined as the text information of the cell.
Note that the threshold value may be adjusted according to the size of the text information in the PDF table. For example, different thresholds and threshold combinations of 2 pixels, 5 pixels, and 10 pixels may be used as the thresholds on the ordinate and abscissa.
Exemplary pseudo code for determining a feature column of a PDF table is provided herein.
if gl['page_number']==tcl['page_number']\
and(abs(year_cell['x0']-gl['x0'])<=15or abs(year_cell['x1']-gl['x1'])<=15)\
and((abs(tcl['y0']-gl['y0'])<=5and abs(tcl['y1']-gl['y1'])<=5)or tcl['y0']<gl['y1']<tcl['y1'])
In the pseudo code, parameters y0 and y1 represent the upper ordinate and the lower ordinate, parameters x0 and x1 represent the right abscissa and the left abscissa, and page _ number represents the number of pages.
Thus, the plurality of conditions in the pseudo code correspond to adding a threshold of 15 pixels on the left abscissa and the right abscissa, respectively; adding a threshold value of 5 pixels on an upper vertical coordinate and a lower vertical coordinate; the text information of the feature column is on the same page as the target keyword.
Note that what has been shown above is only pseudo code of the traversal operation, on the basis of which the above traversal operation can be implemented in any program language.
With the method 1700, the text information in the determined cell of the PDF table and within the cell threshold range may be obtained as the text information of the cell. The cells can also be accurately mapped to the corresponding feature rows and feature columns by means of the abscissa and the ordinate, so that the acquired text information retains its actual meaning.
The present disclosure also provides methods for further processing the mined PDF forms.
Fig. 18 shows a flow diagram of a method 1800 for mining PDF tables in a PDF file according to an embodiment of the present disclosure. In the table shown above, the table information tends to have a strong correlation with the year of the row and column in which it is located and the index identification. Year information and index identification information of a table may be mined by the method 1800 for table information mining.
At step 1802, the computing device 210 determines the feature row as a year information row.
Taking fig. 1 as an example, the text information of the year information line may include "financial statements (million yuan)", [2019A ], [2020E ], [2021E ], [2022E ] "and the like.
At step 1804, computing device 210 determines the feature column as an index identification column.
Taking fig. 1 as an example, the text information of the index identification column may include "profit list", "revenue, operating cost", and the like.
At step 1806, the computing device 210 determines the text information to the right of the indicator identification column having the same row location information as the indicator identification and below the year information row having the same column location information as the year information row as the numerical value associated with the year information and the indicator identification.
Taking fig. 1 as an example, the first column ("financial statement (million yuan)" and below) may be considered as a reference identifier column, the first column on the right is a data column, and the row of "2019A" is considered as the year information row, so the data "1, 064" corresponding to "business cost" and "2019A" may be considered as the business cost of the enterprise in 2019. Through the mode, each number in the PDF table can be matched with the year data and the index identification, so that the actual meaning of the number is reserved through table mining.
Fig. 19 shows a flowchart of a method 1900 for mining PDF tables in a PDF file according to an embodiment of the present disclosure. Because the PDF files issued by organizations (e.g., financial institutions) have strong performance characteristics, the organizations (e.g., financial institutions writing PDF files) associated with the PDF forms can be determined through organization mining methods.
At step 1902, the computing device 210 builds a mechanism key feature array for a plurality of mechanisms associated with the PDF form, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond.
Specifically, the user may preset the number of key features associated with the organization, the key features, and the weights corresponding to the key features. For example, for a certain security company, the user may set 3 key features for the security company, which are respectively the company name, the company organization registration office address, and the company identification (logo), and assign corresponding weights to the corresponding features, for example, the company name has a weight of 1, the company organization registration office address has a weight of 3, and the company identification (logo) has a weight of 5, and the higher the weight, the more relevant the feature is to the organization.
At step 1904, computing device 210 retrieves the text information extracted based on the PDF table based on the organization key feature array to determine the number of occurrences of key features associated with the organization. By setting the key feature, the text information extracted from the PDF file can be retrieved, and the manner of extracting the information can be as described above. Through text retrieval, the number of occurrences of key features associated with an organization may be determined.
At step 1906, the computing device 210 generates a mechanism weight sequence based on the calculated number of occurrences of the key feature associated with the mechanism for use in determining a target association mechanism for the PDF table. After the key features, feature weights, and number of occurrences of the features are obtained, a sequence of institution weights may be generated. And mining the mechanism associated with the PDF table by sorting the mechanism weight sequence. For example, if the first in the organization weight sequence order is a security company, the PDF file may be considered to be associated with the security company, for example, the file was written by the security company.
Fig. 20 illustrates a flow diagram of a method 2000 for determining a target association mechanism for a PDF form according to an embodiment of the present disclosure. Method 20 corresponds to step 1906 in method 1900.
In step 2002, the computing device 210 determines the institution corresponding to the maximum value in the institution weight sequence. By the method of the method 1100, the mechanism weight sequence of the PDF file is acquired, and the mechanism corresponding to the maximum value in the sequence is specified.
In step 2004, computing device 210 determines whether the number of institutions corresponding to the maximum value is 1, i.e., whether there is more than one institution corresponding to the maximum value. For example, there are two or more of the same maximum, corresponding to two or more different mechanisms, respectively.
In step 2006, the computing device 210 determines that the institution corresponding to the maximum value is the target association institution of the PDF table in response to determining that the number of institutions corresponding to the maximum value is 1. When only 1 maximum value exists, the mechanism corresponding to the maximum value is the target association mechanism of the PDF table.
At step 2008, the computing device 210 determines that the target-associated organization is not identified in response to determining that the number of organizations corresponding to the maximum value is greater than 1. If there are multiple identical maximum values and the mechanism corresponding to the maximum values is different, the target association mechanism of the PDF table cannot be determined. Further methods (e.g. manual identification) are needed to determine the target association mechanism of the PDF form.
Fig. 21 shows a schematic block diagram of an example electronic device 2100 that can be used to implement embodiments of the present disclosure. For example, computing device 210 as shown in fig. 2 may be implemented by electronic device 2100. As shown, the electronic device 2100 includes a Central Processing Unit (CPU)2101 that may perform various suitable acts and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)2102 or loaded from a storage unit 2108 into a Random Access Memory (RAM) 2103. In the random access memory 2103, various programs and data necessary for the operation of the electronic device 2100 may also be stored. The central processing unit 2101, the read-only memory 2102 and the random access memory 2103 are connected to each other via a bus 2104. An input/output (I/O) interface 2105 is also connected to bus 2104.
A plurality of components in the electronic device 2100 are connected to the input/output interface 2105, including: an input unit 2106 such as a keyboard, a mouse, a microphone, and the like; an output unit 2107 such as various types of displays, speakers, and the like; a storage unit 2108 such as a magnetic disk, an optical disk, or the like; and a communication unit 2109 such as a network card, modem, wireless communication transceiver, etc. The communication unit 2109 allows the device 2100 to exchange information/data with other devices over a computer network, such as the internet, and/or various telecommunications networks.
The various processes and processes described above, such as methods 300, 400, 500, 700, 900, 1100, 1300, 1400, 1500, 1600, 1700, 1800, 1900, and 2000, may be performed by the central processing unit 2101. For example, in some embodiments, methods 300, 400, 500, 700, 900, 1100, 1300, 1400, 1500, 1600, 1700, 1800, 1900, and 2000 can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 2108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 2100 via read only memory 2102 and/or communications unit 2109. When the computer program is loaded into the random access memory 1303 and executed by the central processing unit 2101, one or more of the acts of the methods 300, 400, 500, 700, 900, 1100, 1300, 1400, 1500, 1600, 1700, 1800, 1900, and 2000 described above may be performed.
The present disclosure relates to methods, apparatuses, systems, electronic devices, computer-readable storage media and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge computing devices. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be appreciated by persons skilled in the art that the present invention is not limited to the embodiments described above, but that the invention may be embodied in many other forms without departing from the spirit or scope of the invention. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made thereto without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims (16)

1. A method for mining PDF tables in a PDF file, comprising:
setting a target keyword and configuration information associated with the target keyword for a PDF table;
analyzing the PDF file so as to obtain text information in the PDF file;
determining first text information based on the configuration information and the acquired text information;
determining second text information based on a predefined position of the target keyword in the first text information and the first text information;
verifying whether the second text information belongs to a feature row of the PDF table based on the extracted features of the PDF table so as to determine the feature row of the PDF table;
determining a feature column of the PDF table based on the target keyword and the first text information; and
and acquiring the text information of the cells of the PDF table according to the determined characteristic rows and characteristic columns.
2. The method of claim 1, wherein verifying whether the second text information belongs to a feature row of the PDF table comprises:
if the second text information accords with the characteristics of the PDF table, determining that the second text information belongs to the characteristic line of the PDF table; and
and if the second text information does not accord with the characteristics of the PDF table, adjusting the predefined position of the target keyword in the first text information so as to determine the second text information again.
3. The method of claim 2, wherein determining second textual information comprises:
determining a right abscissa, an upper ordinate and a lower ordinate of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information;
in the first text information, text information satisfying at least one of the following items is determined as candidate text information:
the right abscissa of the target keyword is opposite to the right;
having an upper ordinate that differs from the upper ordinate of the target keyword by a first threshold;
a lower ordinate having a difference by a first threshold from a lower ordinate of the target keyword; and
and extracting text information which is on the same page as the target keyword and is different from the target keyword from the determined candidate text information as second text information.
4. The method of claim 2, wherein determining second textual information comprises:
determining a left abscissa, a right abscissa and a lower ordinate of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information;
in the first text information, text information satisfying at least one of the following items is determined as candidate text information:
is relatively below the lower ordinate of the target keyword;
having an upper ordinate that differs from a lower ordinate of the target keyword by a second threshold;
a left abscissa and a right abscissa having an intersection with a region from the left abscissa to the right abscissa of the target keyword;
extracting text information which is on the same page as the target keyword, is different from the target keyword and does not accord with the characteristics of the PDF table from the determined candidate text information as a middle character; and
based on the extracted intermediate characters, second text information is determined.
5. The method of claim 2, wherein determining second textual information comprises:
determining the length, the width, the lower ordinate and the right abscissa of the target keyword in a pixel coordinate system of the PDF file based on the predefined position of the target keyword in the first text information;
in the first text information, text information satisfying at least one of the following items is determined as candidate text information:
the lower ordinate of the target keyword is relatively below;
the right abscissa of the target keyword is at the relative right;
having an abscissa, an ordinate lying in a range made up of a third threshold multiple to the width of the target keyword and a fourth threshold multiple to the length of the target keyword; and
and extracting text information which is on the same page as the target keyword, is different from the target keyword and meets the characteristics of the PDF table from the determined candidate text information as second text information.
6. The method of claim 2, wherein determining second textual information comprises:
determining an upper ordinate and a left abscissa of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information;
in the first text information, text information satisfying at least one of the following items is determined as candidate text information:
the keyword is positioned above the upper ordinate of the target keyword;
the right abscissa of the target keyword is opposite to the right;
a left abscissa having a difference from the left abscissa of the target keyword by a fifth threshold range; and
and extracting text information which is on the same page as the target keyword, is different from the target keyword, meets the characteristics of the PDF table and has a field length within a sixth threshold range from the determined candidate text information as second text information.
7. The method of any of claims 3-6, wherein determining second textual information comprises:
sequencing the determined second text information;
verifying whether the second text information which is sequenced has a character string; and
and if the second text information after sequencing has the text character string, removing the text character string and all texts of the text character string behind the abscissa.
8. The method of any of claims 3-6, wherein verifying whether the second text information belongs to a feature row of the PDF table comprises:
determining a regular expression expressing the characteristics based on the characteristics in the characteristic row of the PDF table;
verifying whether the second text information conforms to the regular expression; and
and if the second text information accords with the regular expression, determining a characteristic row of a behavior PDF table where the second text information is located.
9. The method of claim 8, wherein determining a feature column of the PDF table further comprises:
determining a left abscissa and a right abscissa of the target keyword in a pixel coordinate system of the PDF file based on a predefined position of the target keyword in the first text information;
determining text information of a left abscissa and a right abscissa having an intersection with a region from the left abscissa to the right abscissa of the target keyword in the first text information; and
extracting text information which is on the same page as the target keyword, is different from the target keyword and does not include a predetermined special character from the determined text information; and
and determining the extracted text information as a characteristic column of the PDF table.
10. The method of claim 9, wherein obtaining text information for a cell of the PDF table according to the determined feature rows and feature columns further comprises:
determining a longitudinal axis coordinate and a transverse axis coordinate of the PDF table based on the determined left horizontal coordinate and right horizontal coordinate of the feature row of the PDF table and the upper vertical coordinate and lower vertical coordinate of the feature column of the PDF table; and
and acquiring the cells of the PDF table based on the longitudinal axis coordinate and the horizontal axis coordinate of the PDF table.
11. The method of claim 10, wherein obtaining cells of a PDF table based on vertical and horizontal axis coordinates of the PDF table comprises:
determining left horizontal coordinates, right horizontal coordinates, upper vertical coordinates and lower vertical coordinates of all text information in the feature rows and the feature columns of the PDF table based on the feature rows and the feature columns of the PDF table; and
acquiring text information of a cell of a PDF table, wherein the text information satisfies at least one of the following items:
having a left abscissa differing from the left abscissas of all text information of the feature line by a seventh threshold;
having a right abscissa differing from the right abscissas of all text information of the feature line by a seventh threshold;
having an upper ordinate that differs by an eighth threshold value from the upper ordinates of all the text messages of the feature column;
having a lower ordinate differing by an eighth threshold value from the lower ordinate of all text information of the feature line.
12. The method of claim 1, further comprising:
determining the characteristic line as a year information line;
determining the characteristic column as an index identification column; and
text information having the same row position information as the index identification and below the year information row, which is located to the right of the index identification column, is determined as the numerical value associated with the year information and the index identification.
13. The method of claim 1, further comprising:
constructing a mechanism key feature array for a plurality of mechanisms associated with a PDF form, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond;
based on the mechanism key feature array, searching the text information extracted based on the PDF table so as to determine the occurrence frequency of key features associated with the mechanism; and
based on the calculated number of occurrences of the key feature associated with the organization, an organization weight sequence is generated for determining a target associated organization of the PDF form.
14. The method of claim 13, wherein determining a target association mechanism for a PDF form further comprises:
determining a mechanism corresponding to the maximum value in the mechanism weight sequence;
determining whether the number of mechanisms corresponding to the maximum value is 1;
in response to determining that the number of mechanisms corresponding to the maximum value is 1, determining that the mechanism corresponding to the maximum value is a target associated mechanism of the PDF table; and
in response to determining that the number of institutions corresponding to the maximum value is greater than 1, determining that the target-associated institution is not identified.
15. A computing device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor;
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-14.
CN202111554602.8A 2021-12-17 2021-12-17 Method, apparatus and medium for mining PDF tables in PDF file Pending CN114201620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111554602.8A CN114201620A (en) 2021-12-17 2021-12-17 Method, apparatus and medium for mining PDF tables in PDF file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111554602.8A CN114201620A (en) 2021-12-17 2021-12-17 Method, apparatus and medium for mining PDF tables in PDF file

Publications (1)

Publication Number Publication Date
CN114201620A true CN114201620A (en) 2022-03-18

Family

ID=80655058

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111554602.8A Pending CN114201620A (en) 2021-12-17 2021-12-17 Method, apparatus and medium for mining PDF tables in PDF file

Country Status (1)

Country Link
CN (1) CN114201620A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545008A (en) * 2022-11-29 2022-12-30 明度智云(浙江)科技有限公司 Spectrogram file analyzing method, device, equipment and storage medium
CN117454851A (en) * 2023-12-25 2024-01-26 浙江大学 PDF document-oriented form data extraction method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document
CN110968667A (en) * 2019-11-27 2020-04-07 广西大学 Periodical and literature table extraction method based on text state characteristics
CN111062259A (en) * 2019-11-25 2020-04-24 泰康保险集团股份有限公司 Form recognition method and device
CN112287660A (en) * 2019-12-04 2021-01-29 上海柯林布瑞信息技术有限公司 Method and device for analyzing table in PDF file, computing equipment and storage medium
CN112380812A (en) * 2020-10-09 2021-02-19 北京中科凡语科技有限公司 Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format)
CN112509661A (en) * 2021-02-03 2021-03-16 南京吉拉福网络科技有限公司 Methods, computing devices, and media for identifying physical examination reports
CN112990110A (en) * 2021-04-20 2021-06-18 数库(上海)科技有限公司 Method for extracting key information from research report and related equipment
CN113642380A (en) * 2021-06-04 2021-11-12 深度交叉(南京)智能科技有限公司 Identification technology for wireless form
CN114022888A (en) * 2022-01-06 2022-02-08 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for identifying PDF form
CN114116616A (en) * 2022-01-26 2022-03-01 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for mining PDF files

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document
US20190294663A1 (en) * 2018-03-26 2019-09-26 Abc Fintech Co., Ltd. Method and device for positioning table in pdf document
CN111062259A (en) * 2019-11-25 2020-04-24 泰康保险集团股份有限公司 Form recognition method and device
CN110968667A (en) * 2019-11-27 2020-04-07 广西大学 Periodical and literature table extraction method based on text state characteristics
CN112287660A (en) * 2019-12-04 2021-01-29 上海柯林布瑞信息技术有限公司 Method and device for analyzing table in PDF file, computing equipment and storage medium
CN112380812A (en) * 2020-10-09 2021-02-19 北京中科凡语科技有限公司 Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format)
CN112509661A (en) * 2021-02-03 2021-03-16 南京吉拉福网络科技有限公司 Methods, computing devices, and media for identifying physical examination reports
CN112990110A (en) * 2021-04-20 2021-06-18 数库(上海)科技有限公司 Method for extracting key information from research report and related equipment
CN113642380A (en) * 2021-06-04 2021-11-12 深度交叉(南京)智能科技有限公司 Identification technology for wireless form
CN114022888A (en) * 2022-01-06 2022-02-08 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for identifying PDF form
CN114116616A (en) * 2022-01-26 2022-03-01 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for mining PDF files

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545008A (en) * 2022-11-29 2022-12-30 明度智云(浙江)科技有限公司 Spectrogram file analyzing method, device, equipment and storage medium
CN115545008B (en) * 2022-11-29 2023-04-07 明度智云(浙江)科技有限公司 Spectrogram file analyzing method, device, equipment and storage medium
CN117454851A (en) * 2023-12-25 2024-01-26 浙江大学 PDF document-oriented form data extraction method and device
CN117454851B (en) * 2023-12-25 2024-03-12 浙江大学 PDF document-oriented form data extraction method and device

Similar Documents

Publication Publication Date Title
CN113807098B (en) Model training method and device, electronic equipment and storage medium
US10592738B2 (en) Cognitive document image digitalization
US7937338B2 (en) System and method for identifying document structure and associated metainformation
US10095780B2 (en) Automatically mining patterns for rule based data standardization systems
US10229154B2 (en) Subject-matter analysis of tabular data
JP2020511726A (en) Data extraction from electronic documents
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN111512315A (en) Block-wise extraction of document metadata
US20150067476A1 (en) Title and body extraction from web page
CN114201620A (en) Method, apparatus and medium for mining PDF tables in PDF file
US11568666B2 (en) Method and system for human-vision-like scans of unstructured text data to detect information-of-interest
US11048934B2 (en) Identifying augmented features based on a bayesian analysis of a text document
CN114022888B (en) Method, apparatus and medium for identifying PDF form
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN112541359A (en) Document content identification method and device, electronic equipment and medium
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
US20190303437A1 (en) Status reporting with natural language processing risk assessment
US20210390488A1 (en) Citation and policy based document classification
CN114092948A (en) Bill identification method, device, equipment and storage medium
Khemani et al. A review on reddit news headlines with nltk tool
CN113255369A (en) Text similarity analysis method and device and storage medium
CN114116616B (en) Method, apparatus and medium for mining PDF files
CN115544213B (en) Method, device and storage medium for acquiring information in text
CN111860513A (en) Optical character recognition support system
CN116415562A (en) Method, apparatus and medium for parsing financial data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 201203 Room 501, building 4, No. 690, Bibo Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant after: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Address before: 201203 building 4, No. 690, Bibo Road, Zhangjiang High Tech, Pudong New Area, Shanghai

Applicant before: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Room 201-1 and Room 201-3, Building 4, No. 690 Bibo Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Applicant after: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Address before: 201203 Room 501, building 4, No. 690, Bibo Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant before: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Country or region before: China

CB02 Change of applicant information