CN114116616A - Method, apparatus and medium for mining PDF files - Google Patents

Method, apparatus and medium for mining PDF files Download PDF

Info

Publication number
CN114116616A
CN114116616A CN202210089715.3A CN202210089715A CN114116616A CN 114116616 A CN114116616 A CN 114116616A CN 202210089715 A CN202210089715 A CN 202210089715A CN 114116616 A CN114116616 A CN 114116616A
Authority
CN
China
Prior art keywords
data
pdf file
coordinate
text
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210089715.3A
Other languages
Chinese (zh)
Other versions
CN114116616B (en
Inventor
郭鹏华
尹扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Suntime Information Technology Co ltd
Original Assignee
Shanghai Suntime Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Suntime Information Technology Co ltd filed Critical Shanghai Suntime Information Technology Co ltd
Priority to CN202210089715.3A priority Critical patent/CN114116616B/en
Publication of CN114116616A publication Critical patent/CN114116616A/en
Application granted granted Critical
Publication of CN114116616B publication Critical patent/CN114116616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments of the present disclosure relate to methods, devices, and media for mining PDF files. In the method, the text block of the PDF file can be parsed to obtain the coordinate information of the text block of the PDF file; determining a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file; matching one or more report templates of the target correlation mechanism with the coordinate information of the text block by using a matching algorithm, so as to determine matching degree data of the one or more report templates and the PDF file; determining a report template of a target association mechanism corresponding to the PDF file based on the acquired matching degree data; and mining data corresponding to the determined report template in the PDF file based on the determined report template. Thus, the present disclosure enables accurate mining of data of a PDF file and associating the mined data with its actual meaning.

Description

Method, apparatus and medium for mining PDF files
Technical Field
Embodiments of the present disclosure relate generally to the field of data processing, and more particularly, to a method, computing device, and computer-readable storage medium for mining PDF files.
Background
PDF (Portable Document Format) is an electronic Document Format developed by Adobe corporation, which has a characteristic of independence from an operating system platform. PDF belongs to a layout document, and pages are relatively independent, so that the document layout can be accurately described and the document layout can be displayed. However, the PDF does not record the frame structure of the file, in other words, the PDF file does not include the organizational formulaic relationships therein.
In professional PDF files (e.g., commercial, financial, legal PDF files) containing diversified contents, a page of PDF file is often mixed with contents such as a subject, a text, a form, a decorative element, a specific institution identification, and the like, and the contents often include different codes such as numbers, characters, special symbols, and the like.
Conventional approaches for mining PDF files are done by directly identifying text blocks in the PDF file. The recognized text blocks are extracted directly into editable characters. Although some commonly used schemes or tools in the art may simply extract the identified text blocks according to the row/column format and retain part of the PDF file frame, the frame structure of most PDF files is lost in the process.
Meanwhile, in the current PDF file, the file frame structure often has a practical meaning. Missing a frame means that the data has lost the corresponding actual meaning. And a large amount of time and labor cost are needed to label the data at the later stage. Such labeling is not feasible if the amount of data is large.
In summary, the conventional solution for mining PDF files has the following disadvantages: the original frame structure of the PDF file is lost when the PDF file is mined into editable data.
Disclosure of Invention
In view of the above, the present disclosure provides a method, computing device and computer-readable storage medium for mining PDF files. The method can accurately mine the data of the PDF file and associate the mined data with the actual meaning of the data.
According to a first aspect of the present disclosure, there is provided a method for mining PDF files, comprising: analyzing the text block of the PDF file so as to obtain the coordinate information of the text block of the PDF file; determining a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file; matching one or more report templates of the target correlation mechanism with the coordinate information of the text block by using a matching algorithm, so as to determine matching degree data of the one or more report templates and the PDF file; determining a report template of a target association mechanism corresponding to the PDF file based on the acquired matching degree data; and mining data corresponding to the determined report template in the PDF file based on the determined report template.
According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the disclosure.
In a third aspect of the present disclosure, a non-transitory computer readable storage medium is provided having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.
In one embodiment, determining a target association mechanism associated with the PDF file using a mechanism determination algorithm comprises: constructing a mechanism key feature array for a plurality of mechanisms associated with a PDF file, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond; based on the mechanism key feature array, searching the text blocks analyzed based on the PDF file so as to determine the occurrence times of key features associated with the mechanism; and calculating a weight sequence of the target association mechanism based on the determined number of times the key features associated with the mechanism appear, so as to be used for determining the target association mechanism of the PDF file.
In one embodiment, determining the target association mechanism for the PDF file further comprises: determining a mechanism corresponding to a maximum value in the weight sequence; determining whether the number of mechanisms corresponding to the maximum value is 1; in response to determining that the number of mechanisms corresponding to the maximum value is 1, determining that the mechanism corresponding to the maximum value is a target associated mechanism of the PDF file; and determining that the target-associated entity is not identified in response to determining that the number of entities corresponding to the maximum value is greater than 1.
In one embodiment, matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm comprises: defining, for each of the one or more report templates, an identifying feature block, respectively; acquiring coordinate information of the identification feature block; for each of the one or more report templates, calculating a matching value of the report template and a text block according to a matching function based on coordinate information of the text block and coordinate information of an identification feature block of the report template; and operating all the calculated matching values, thereby determining the matching degree data of one or more report templates and the PDF file.
In one embodiment, calculating a match value of the report template to a text block according to a matching function comprises: the match value of the match function is a first predetermined value if it is determined that at least one of the following conditions is met: the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block, the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block, the abscissa value of the upper left coordinate of the recognition feature block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the text block and the ordinate value of the upper left coordinate of the recognition feature block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the text block, and the ordinate value of the lower right coordinate of the recognition feature block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value interval of the lower right coordinate of the recognition feature block The coordinate values fall into a vertical coordinate value interval of an upper left coordinate and a lower right coordinate of the recognition feature block; and in the case where it is determined that any one of the above conditions is not satisfied, the matching value of the matching function is a second predetermined value.
In one embodiment, mining the data in the PDF file corresponding to the determined reporting template comprises: for each report template of the one or more report templates, respectively defining a mining feature block; determining coordinate information of the mining feature block; calculating a matching value of the determined report template and the text block according to a mining matching function based on the coordinate information of the text block and the coordinate information of the mining feature block of the determined report template; mining the text blocks with the matching value not being the second predetermined value as the data corresponding to the determined report template.
In one embodiment, the method for mining PDF files further comprises: verifying the validity of the mined data based on the determined data structure of the mined feature blocks of the report template; performing normalization processing on the mined data in the PDF file in response to the validity of the mined data being legal; and in response to the validity of the mined data being illegal, determining other report templates based on the matching degree data and re-mining the PDF file.
In one embodiment, performing a normalization process on the mined data in the PDF file comprises: defining a standard expression of a mining feature block of the report template; defining a plurality of corresponding non-standard expressions based on the standard expressions; and uniformly converting the non-standard expressions in the mined data into the standard expressions.
In one embodiment, performing normalization processing on the mined data in the PDF file further comprises: determining, based on the mined data, a closest actual year associated with stock data in the mined data; inquiring real data of the stock data in the closest actual year; and comparing the stock data with the real data to obtain units of the stock data.
In one embodiment, the number of text characters of the acquired text block is calculated; determining whether the calculated number of text characters is greater than or equal to a predetermined number of characters threshold; in response to determining that the calculated number of text characters is greater than or equal to a predetermined character number threshold, calculating a similarity of the acquired text blocks based on a first algorithm; calculating a similarity of the acquired text blocks based on a second algorithm in response to determining that the calculated number of text characters is less than or equal to a predetermined character number threshold; and performing deduplication on the acquired text block based on the similarity calculation result.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.
Fig. 1 shows a schematic diagram of a system 100 for mining PDF files according to an embodiment of the present disclosure.
Fig. 2 shows a flow diagram of a method 200 for mining PDF files according to an embodiment of the present disclosure.
FIG. 3 shows a flow diagram of a method 300 for determining a target association mechanism associated with the PDF file using a mechanism determination algorithm, in accordance with an embodiment of the present disclosure.
FIG. 4 shows a flow diagram of a method 400 for determining a target association mechanism for a PDF file according to an embodiment of the present disclosure.
FIG. 5 illustrates a flow diagram of a method 500 for matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm in accordance with an embodiment of the present disclosure.
Fig. 6 shows a schematic diagram of defining an identification feature block according to an embodiment of the present disclosure.
Fig. 7 shows a flowchart of a method 700 of mining data in the PDF file corresponding to the determined report template according to an embodiment of the present disclosure.
FIG. 8 illustrates a flow diagram of a method 800 of verifying the legitimacy of mined data in accordance with an embodiment of the present disclosure.
FIG. 9 illustrates a flow diagram of a method 900 of performing a normalization process on mined data in the PDF file according to an embodiment of the disclosure.
Fig. 10 shows a flow diagram of a method 1000 of performing a normalization process on mined data in the PDF file according to an embodiment of the present disclosure.
FIG. 11 shows a flow diagram of a method 1100 for deduplication against an acquired text chunk in accordance with an embodiment of the present disclosure.
FIG. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, PDF files containing complex frame structures cannot be accurately mined or identified by conventional approaches for mining PDF files. Even if data in a PDF file is mined, a lost file frame structure will cause a loss of the actual meaning of the mined data. Because the PDF files have no uniform format, it is difficult to generally mine a large amount of PDF files directly through the uniform format. This problem is particularly acute in professional PDF files. Such PDF files have a large number of complex characters and a complex file frame structure. Meanwhile, if the data is not mined according to the framework structure, the data loses the practical meaning. Such data requires manual re-labeling at a later time, which is difficult to perform in large data.
To address at least in part one or more of the above issues and other potential issues, an example embodiment of the present disclosure proposes a scheme for mining PDF files, in which a mechanism to which a PDF file belongs may be determined by a mechanism determination algorithm. With each organization defining one or more report templates. By employing the template matching algorithm proposed by the present disclosure, a PDF file may be matched into one or more report templates defined. Meanwhile, by adopting a data mining algorithm, data can be mined from the PDF file through the matched report template, so that the PDF file can be processed into data with a regular structure. For example, PDF files are mined as Excel dataforms, XML files, YAML files, etc. that are associated with actual meaning.
In addition, the disclosure also provides a corresponding method for further mining (such as year mining, data deep mining and table segmentation) of the mined data, so that the fineness of the mined data is improved.
Fig. 1 shows a schematic diagram of a system 100 for mining PDF files according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes a computing device 110 and a PDF file management device 130 and a network 140. The computing device 110, the PDF file management device 130 may interact with data through a network 140 (e.g., the internet).
A PDF file management device 130, which may perform, for example, a regular management of PDF files, such as collection and storage of PDF files. The PDF file management device 130 may also send the managed PDF files to the computing device 110. The PDF file management device 130 is, for example and without limitation: desktop computers, laptop computers, netbook computers, tablet computers, web browsers, e-book readers, Personal Digital Assistants (PDAs), wearable computers (such as smart watches and activity tracker devices), and the like, that can perform PDF file reading and modification. The PDF file management device 130 may be configured to store PDF files, send PDF files to the computing device 110 via the network 140, and receive PDF files from the computing device 110 for processing.
With respect to the computing device 110, it is used, for example, to receive PDF files from the PDF file management device 130 via the network 140. The computing device 110 may perform mechanism identification on the received PDF file. Based on the identified organization, a template for the organization associated with the PDF file may be matched. Based on the matched template, relevant data can be accurately mined from the PDF file. Computing device 110 may also perform text block deduplication, data validation, and normalization on the mined data. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as a CPU. Additionally, one or more virtual machines may also be running on each computing device 110. In some embodiments, the computing device 110 and the PDF file management device 130 may be integrated or may be provided separately from each other. In some embodiments, computing device 110 includes, for example, a target table region extraction unit 112, a mechanism determination unit 114, a template matching unit 116, a template determination unit 118, a data mining unit 120, and an additional processing unit 122.
An extracting unit 112, the extracting unit 112 being configured to parse the text blocks of the PDF file so as to acquire coordinate information of the text blocks of the PDF file.
A mechanism determination unit 114, the mechanism determination unit 114 configured to determine a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file.
A template matching unit 116, wherein the template matching unit 116 is configured to match one or more report templates of the target association mechanism with the coordinate information of the text block by using a matching algorithm, so as to determine matching degree data of the one or more report templates and the PDF file.
A template determination unit 118, wherein the template determination unit 118 is configured to determine a report template of a target association mechanism corresponding to the PDF file based on the acquired matching degree data.
A data mining unit 120, the data mining unit 110 being configured to mine data in the PDF file corresponding to the determined report template based on the determined report template.
The additional processing unit 122 may be configured to perform various operations such as data validation, data normalization, data deduplication, and so on.
Units 112-120 may extract text information in a PDF file. Based on the extracted text information, an association mechanism associated with the PDF file may be determined. After the association mechanism is determined, the report template associated with the PDF file can be determined by means of coordinate matching. Based on the determined report template, the data in the PDF file can be mined in a matching mode, so that the PDF file can be accurately mined, and the actual meaning of the data in the PDF file is reserved.
Based on the data mined by units 112-120, the additional processing unit 122 may also perform various operations such as data validation, data normalization, data deduplication, etc. on the mined data. After the above processing is completed for the PDF file, the data in the mined PDF file may be transmitted to the PDF file management apparatus 130 via the network 140.
Note that the scheme of the present disclosure for mining PDF files involves a coordinate system for locating characters in the PDF file. In the art, the coordinate system of the PDF file may have the upper left corner as the origin, the x horizontal direction right of the origin, and the y vertical direction right below the origin. Based on such a coordinate system, a standard text message can be located with the upper left and lower right coordinates. However, it is also possible to establish different coordinate systems based on different ways. The selection of the coordinate system does not influence the technical scheme for mining the PDF file provided by the disclosure.
A method 200 for mining PDF files is described below in conjunction with fig. 1. Fig. 2 shows various paths and orders for the purpose of presenting the working principle of the scheme for mining PDF files as a whole, but some of the paths and paths are not necessary for implementing the following examples, and various methods involved in the technical solution of the present disclosure may be performed in different orders and paths.
Fig. 2 shows a flow diagram of a method 200 for mining PDF files according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12. It should be understood that method 200 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the present disclosure is not limited in this respect.
At step 202, the computing device 110 may parse the text blocks of the PDF file to obtain coordinate information of the text blocks of the PDF file.
In some embodiments, the computing device 110 may parse all or a portion of the text blocks in the PDF file into editable text blocks via processing tools commonly used in the PDF processing field, such as PDFminer, Camelot, and the like.
Note that parsing a text block using processing tools commonly used in the field of PDF processing involves parsing only the text content in a PDF file, i.e., identifying processable characters or character strings legally defined therein. Here, the processing tool does not recognize the structure of the PDF file. For example, associations between text blocks in a file are not identified at this step.
It should also be noted that the processing tools commonly used in the field of PDF processing may include any code, software, library files that can parse PDF text, such as software packages or software libraries invokable by Python, Java, etc. programming languages, including but not limited to PDFminer, camellot, etc.
At step 204, the computing device 110 may determine a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file.
In one embodiment, based on the text blocks of the PDF file parsed in step 202, computing device 110 may utilize a mechanism determination algorithm to determine a target association mechanism associated with the PDF file. In the context of the present disclosure, a target association mechanism may be any entity that has an association with a PDF file, such as a producer, or an issuer of the PDF file.
Since the target affiliate typically makes public PDF files using a fixed one or more reporting templates, such PDF files have a strong correlation in the time dimension. With this strong correlation, a reporting template associated with the PDF file may be determined, mining the data of the PDF file based on the reporting template.
The principle of the institution-determination algorithm is to identify the target affiliate among multiple institutions based on the strong characteristics (e.g., address, logo identification) of the parsed text block. By defining the set of characteristics of the institution, it is possible to calculate how many text blocks are associated with the institution characteristics. Further, by the weight calculation, scores of the PDF files with respect to a plurality of institutions can be calculated. Based on the optimal score, the mechanism associated with the PDF file can be determined.
The mechanism determination algorithm and the mechanism determination step will be described in detail hereinafter.
At step 206, the computing device 110 may determine matching data of one or more report templates to the PDF file by matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm.
In one embodiment, based on the target affiliation determined in step 204, the computing device 110 may match one or more report templates defined under the target affiliation name to the coordinate information of the text block, respectively. The computing device 110 will match the text blocks according to the features in the report template to determine one or more match-degree data for one or more report templates.
The matching algorithm and the matching step will be described in detail below.
At step 208, the computing device 110 may determine a report template for the target affiliate corresponding to the PDF file based on the obtained match data.
In one embodiment, based on the one or more match data determined in step 206 for the one or more reporting templates for the target affiliate, the computing device 110 may determine the reporting template in which the best match data is the reporting template for the target affiliate corresponding to the PDF file. The determined report template may be used in a subsequent step to mine the data of the PDF file.
At step 210, the computing device 110 may mine the data in the PDF file corresponding to the determined reporting template based on the determined reporting template.
In one embodiment, the computing device 110 may mine the PDF file using a data mining algorithm based on the report template determined at step 208, thereby mining the data in the PDF file corresponding to the determined report template. In particular, the computing device 110 may match the mined features in the report template to the PDF file again according to a data mining algorithm. And if the matching is successful, mining the data matched with the mining features into data associated with the mining features, adding corresponding identifications, and extracting or storing the data out of the PDF file. If the matching fails, the processing can be further processed by methods such as manual identification and the like.
The data mining algorithm and the mining steps will be described in detail below.
FIG. 3 shows a flow diagram of a method 300 for determining a target association mechanism associated with the PDF file using a mechanism determination algorithm, in accordance with an embodiment of the present disclosure. Method 300 corresponds to step 204 of method 200. The method 300 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12.
As described above, PDF files from production facilities (e.g., legal facilities, financial facilities) have strong characteristics. The target associated mechanism with which the PDF file is associated (e.g., the mechanism that produced the PDF file) can be determined based on the strong characteristics of the mechanism.
At step 302, the computing device 110 may build a mechanism key feature array for a plurality of mechanisms associated with a PDF file, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond.
Specifically, the user may pre-construct the number of key features associated with the organization, the key features, and the weights corresponding to the key features. For example, for a certain security company, the user may define one or more (e.g., 3) key features for the security company, such as a company name, a company organization registration office address, and a company identifier (logo), and assign corresponding weights to the corresponding features, such as a weight of 1 for the company name, a weight of 3 for the company organization registration office address, and a weight of 5 for the company identifier (logo), and consider the feature to be more relevant to the organization with a higher weight.
At step 304, the computing device 110 may retrieve text blocks parsed based on the PDF file based on the organization key feature array to determine the number of occurrences of key features associated with the organization. By having the key features, the parsed text blocks in the PDF file can be retrieved, and the manner of extracting the information can be as described above. Through text retrieval, the number of occurrences of key features associated with an organization may be determined. The number of times a key feature occurs may be matched with the weights as defined in step 302 to calculate the likelihood of the associated institution.
At step 306, the computing device 110 may calculate a sequence of weights for the target association mechanism based on the determined number of occurrences of the key feature associated with the mechanism for use in determining the target association mechanism for the PDF file. After obtaining the key features, feature weights, and number of occurrences of the features, a sequence of weights for the target institution may be generated. And mining the PDF file with the first rank as a target association mechanism by ranking the weight sequence aiming at the target mechanism. For example, if the first in the weight sequence ordering is a security company, the PDF file may be considered to be associated with the security company, for example, the file was written by the security company.
With this solution, the degree of correlation (e.g. a sequence of weights) of one or more institutions with a PDF file can be calculated by the strong features of the defined institutions. The target association mechanism associated with the PDF file may be determined by the weight sequence.
FIG. 4 shows a flow diagram of a method 400 for determining a target association mechanism for a PDF file according to an embodiment of the present disclosure. Method 400 corresponds to step 204 in method 200. The method 400 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12.
At step 402, the computing device 110 may determine the mechanism corresponding to the maximum value in the sequence of weights. By the method described in method 300, a weight sequence of PDF files can be obtained, and a mechanism corresponding to the maximum value in the sequence can be specified.
At step 404, the computing device 110 may determine whether the number of institutions corresponding to the maximum value is 1, i.e., whether there is more than one institution corresponding to the maximum value. For example, the presence of two or more of the same maximum values means two or more different mechanisms, respectively.
At step 406, the computing device 110 may determine that the institution corresponding to the maximum value is the target affiliate of the PDF file in response to determining that the number of institutions corresponding to the maximum value is 1. If only 1 maximum value exists, the mechanism corresponding to the maximum value is the target association mechanism of the PDF file.
At step 408, the computing device 110 may determine that the target-associated organization is not identified in response to determining that the number of organizations corresponding to the maximum value is greater than 1. If a plurality of same maximum values exist and the mechanisms corresponding to the maximum values are different, the target association mechanism of the PDF file cannot be determined. Further methods such as human identification are needed to determine the target association mechanism of the PDF file.
With this solution, the target association mechanism associated with the PDF file can be determined by a further method, such as manual identification, when the weight sequence has a plurality of identical values.
FIG. 5 illustrates a flow diagram of a method 500 for matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm in accordance with an embodiment of the present disclosure. Method 500 corresponds to step 206 of method 200. The method 500 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12.
As indicated above, since the target affiliate may be provided with different report templates for different reports, matching of one or more report templates of the target affiliate is also required after the target affiliate is determined.
At step 502, the computing device 110 may define an identifying feature block for each of the one or more reporting templates, respectively.
In one embodiment, the computing device 110 may define an identifying feature block separately for each of one or more report templates of the target affiliate.
Fig. 6 shows a schematic diagram of defining an identification feature block according to an embodiment of the present disclosure. As shown in fig. 6, the computing device 110 may define three identifying feature blocks for the report, respectively a stock code area, a title area, and a summary area. These three key areas essentially cover the main features of the report. Based on these three features, it can be determined whether the PDF file belongs to this reporting template. In other embodiments, identifying feature blocks containing other regions may also be defined. For example, the top right corner may also be defined as the identifying feature block of the title area. The more the identification feature block definition, the more accurate the identification match.
At step 504, the computing device 110 may obtain coordinate information of the identified feature blocks.
In one embodiment, the computing device 110 may obtain coordinate information for the discriminating characteristic block defined at step 502. The coordinate information may include an upper left coordinate and a lower right coordinate of the recognition feature block. Accordingly, the upper right coordinate and the lower left coordinate of the recognition feature block or other coordinates including the recognition feature block may also be included. Note that the coordinate information may be transformed accordingly according to the definition of the coordinate system. Such transformed coordinates are all included in the technical solution of the present disclosure.
At step 506, the computing device 110 may calculate, for each of the one or more reporting templates, a matching value of the reporting template to the text block according to a matching function based on the coordinate information of the text block and the coordinate information of the identified feature block of the reporting template.
In one embodiment, for each of the one or more report templates of the target affiliate, the computing device 110 may calculate a match value of the report template to the text block according to a matching function based on the coordinate information of the text block of the PDF file and the coordinate information of the identifying feature block of the report template acquired in step 504.
Taking the upper left coordinate and the lower right coordinate of the text block and the recognition feature block as an example, the matching function may be expressed as satisfying any one of the following conditions, and then the matching value of the matching function is a first predetermined value:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block;
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block;
recognizing that an abscissa value of an upper left coordinate of the feature block falls within an abscissa value interval of an upper left coordinate and a lower right coordinate of the text block and a ordinate value of an upper left coordinate of the feature block falls within an ordinate value interval of an upper left coordinate and a lower right coordinate of the text block; and
the abscissa value identifying the lower right coordinate of the feature block falls within the interval of the abscissa values identifying the upper left coordinate and the lower right coordinate of the text block and the ordinate value identifying the lower right coordinate of the feature block falls within the interval of the ordinate values identifying the upper left coordinate and the lower right coordinate of the feature block,
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the matching function is a second predetermined value. In an embodiment, the first predetermined value may be 1 and the second predetermined value may be 0. With the above matching function, the computing device 110 may calculate a matching value of the report template to one text block.
Matching function in combination with equation (1)
Figure 120622DEST_PATH_IMAGE002
Can be expressed as:
Figure 310295DEST_PATH_IMAGE004
wherein,
Figure 272435DEST_PATH_IMAGE005
which represents a block of text that is,
Figure 897933DEST_PATH_IMAGE006
the representative identification feature block is a block of the feature,
Figure 199601DEST_PATH_IMAGE007
representing blocks of text
Figure 622492DEST_PATH_IMAGE005
The upper left-hand coordinate of (a),
Figure 9611DEST_PATH_IMAGE008
representing blocks of text
Figure 441729DEST_PATH_IMAGE005
The lower right-hand coordinates of (a),
Figure 332325DEST_PATH_IMAGE009
is a characteristic block
Figure 191697DEST_PATH_IMAGE006
The lower right-hand coordinates of (a),
Figure 800532DEST_PATH_IMAGE010
representative identification feature Block
Figure 36342DEST_PATH_IMAGE006
Upper left coordinate of (d).
Note that the matching function may be set to a combination of other different conditions according to the area of the recognition feature block. For example, the matching function may have a matching value of the matching function of a first predetermined value only if any of the following conditions is satisfied:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block; and
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block;
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the matching function is a second predetermined value. The user can flexibly set the matching function to meet different recognition matching requirements.
Since the PDF file includes one or more text blocks, the computing device 110 may sequentially and respectively calculate matching values of the one or more text blocks and the report template, thereby obtaining total matching degree data of the PDF file and the report template.
At step 508, the computing device 110 may operate on all of the calculated match values to determine match data for one or more report templates to the PDF file.
In one embodiment, the computing device 110 may operate on the matching values calculated in step 506 for one or more text blocks that respectively match a reporting template to determine the matching data of the reporting template to the PDF file. Here, the operation may be implemented by direct summation.
Matching degree data in combination with equation (2)
Figure DEST_PATH_IMAGE011
Can be expressed as:
Figure 47023DEST_PATH_IMAGE013
wherein,
Figure 77296DEST_PATH_IMAGE014
representing matching functions
Figure 235745DEST_PATH_IMAGE015
About the ith text block
Figure 947349DEST_PATH_IMAGE005
With the jth recognition feature block
Figure 874854DEST_PATH_IMAGE006
And (6) matching scores are carried out.
In another embodiment, the operation may be implemented by weighted summation. For example, different weighting factors can be set for one or more text blocks, i.e. a corresponding weighting factor is assigned to each match. For example, the more important regions may be set with a larger weight coefficient. Thereby ensuring that the match data of the report template and the PDF file is accurate enough in the weighted sum.
In this way, by calculating the matching degree data of one or more report templates of the target affiliate with the PDF file, the matching degree data of each report template with the PDF file can be acquired. Therefore, the report template with the highest score can be selected as the report template matched with the PDF file by the target association mechanism.
By using the technical scheme, the report template which is most matched with the PDF file in the target association mechanism can be calculated through the matching function based on the coordinate information. The template may be used in subsequent steps to accurately mine the PDF file.
Fig. 7 shows a flowchart of a method 700 of mining data in the PDF file corresponding to the determined report template according to an embodiment of the present disclosure. Method 700 corresponds to step 210 of method 200. The method 700 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12.
At step 702, the computing device 110 may define a mined feature block for each of the one or more report templates, respectively.
In one embodiment, similar to identifying feature blocks as described above, the computing device 110 may define mined feature blocks separately for each of the one or more report templates. The mining feature block may cover the area where the data that needs to be mined is located. For example, if stock codes and summaries in a PDF file need to be mined, stock code areas and summary areas may be defined as mining feature blocks. In this step, one or more mined feature blocks may be defined.
At step 704, the computing device 110 may obtain coordinate information for the mined feature blocks.
In one embodiment, the computing device 110 may obtain coordinate information of the mined feature blocks. The coordinate information may include an upper left coordinate and a lower right coordinate of the mined feature block. Accordingly, the upper right coordinate and the lower left coordinate of the recognition feature block or other coordinates including the recognition feature block may also be included. Note that the coordinate information may be transformed accordingly according to the definition of the coordinate system. Such transformed coordinates are all included in the technical solution of the present disclosure.
At step 706, the computing device 110 may calculate a match value of the determined report template to the text block according to a mining match function based on the coordinate information of the text block and the coordinate information of the determined mining feature block of the report template.
In one embodiment, based on the report template of the target association mechanism associated with the PDF file determined in the previous step, the computing device 110 may mine the PDF file according to the coordinate information of the mined feature blocks of the report template, the coordinate information of the text blocks, and the mined matching function. The mining matching function in step 706 may be similar to the matching function in step 506. Taking a similar matching function as an example, the mining matching function may be expressed as satisfying any one of the following conditions, and then the matching value of the mining matching function is a first predetermined value:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the mined feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the mined feature block;
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the mined feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the mined feature block;
the horizontal coordinate value of the upper left coordinate of the mining characteristic block falls into the horizontal coordinate value interval of the upper left coordinate and the lower right coordinate of the text block, and the vertical coordinate value of the upper left coordinate of the mining characteristic block falls into the vertical coordinate value interval of the upper left coordinate and the lower right coordinate of the text block; and
the abscissa value of the lower right coordinate of the mined feature block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the text block and the ordinate value of the lower right coordinate of the mined feature block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the mined feature block,
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the mining matching function is a second predetermined value. In an embodiment, the first predetermined value may be 1 and the second predetermined value may be 0. With the above mining matching function, the computing device 110 may determine whether a text block of a PDF file matches a report template.
Combining the formula (3) to mine the matching function
Figure 13711DEST_PATH_IMAGE016
Can be expressed as:
Figure 662386DEST_PATH_IMAGE018
wherein,
Figure 912101DEST_PATH_IMAGE005
which represents a block of text that is,
Figure DEST_PATH_IMAGE019
representing the block of the mined features,
Figure 959692DEST_PATH_IMAGE007
representing blocks of text
Figure 331767DEST_PATH_IMAGE005
The upper left-hand coordinate of (a),
Figure 402491DEST_PATH_IMAGE008
representing blocks of text
Figure 455898DEST_PATH_IMAGE005
The lower right-hand coordinates of (a),
Figure 357995DEST_PATH_IMAGE020
for digging feature blocks
Figure 838655DEST_PATH_IMAGE019
The lower right-hand coordinates of (a),
Figure DEST_PATH_IMAGE021
representing mined feature blocks
Figure 193413DEST_PATH_IMAGE019
Upper left coordinate of (d).
Note that the matching function used in the mining step may also be different from the matching function used in the matching step. Other combinations of different conditions may be set according to the area of the recognition feature block. For example, the mining matching function may only if any of the following conditions is satisfied, then the matching value of the mining matching function is a first predetermined value:
the horizontal coordinate value of the upper left coordinate of the mining characteristic block falls into the horizontal coordinate value interval of the upper left coordinate and the lower right coordinate of the text block, and the vertical coordinate value of the upper left coordinate of the mining characteristic block falls into the vertical coordinate value interval of the upper left coordinate and the lower right coordinate of the text block; and
the abscissa value of the lower right coordinate of the mined feature block falls within the interval of the abscissa values of the upper left coordinate and the lower right coordinate of the text block and the ordinate value of the lower right coordinate of the mined feature block falls within the interval of the ordinate values of the upper left coordinate and the lower right coordinate of the mined feature block.
In a case where it is determined that any one of the above conditions is not satisfied, the matching value of the mining matching function is a second predetermined value. The user can flexibly set the mining matching function to meet different recognition matching requirements.
Since the PDF file includes one or more text blocks, the computing device 110 may, in turn, calculate matching values for the one or more text blocks to mined feature blocks of the report template, respectively. Subsequently, text blocks in the PDF file can be mined according to the matching values.
At step 708, the computing device 110 may mine text blocks for which the match value is not the second predetermined value as data corresponding to the determined report template.
In one embodiment, the computing device 110 may mine the text blocks that were computed in step 706 by the mining matching function and that have a matching value that is not a second predetermined value (e.g., 0), i.e., a matching value of a first predetermined value (e.g., 1), as data corresponding to the determined report template. A mining value of 0 represents a complete mismatch of the text block and the mined feature blocks of the report template.
Mining includes extracting and/or storing data in the PDF file into a corresponding database according to names of the mined feature blocks. For example, "600315. SH" in fig. 6 may be mined as a stock code reported by PDF. At the same time "600315. SH" is stored in the location or database where the stock code for that PDF file should be stored. In this way, the characteristics of the title, author, writing date, stock name, stock code, abstract, etc. of the PDF file can be extracted and/or stored in the corresponding database, respectively, thereby obtaining the original text information data of the desired PDF file.
In one embodiment, if the matching values of the mining feature block and the text block in the matched report template are both 0, the target association mechanism may be considered to propose a new report template. In this case, the PDF file may be transferred to other processing. For example, a new report template is added or defined and the identified feature blocks and mined feature blocks are defined for the new report template.
By the technical scheme, the data corresponding to the PDF file and the matched report template can be mined by the mining matching function. Such data may be extracted and/or stored with corresponding defined meanings, thereby not only mining the data of the PDF file accurately at a high speed but also retaining the corresponding actual meanings of the data.
FIG. 8 illustrates a flow diagram of a method 800 of verifying the legitimacy of mined data in accordance with an embodiment of the present disclosure.
In step 802, the computing device 110 may verify the validity of the mined data based on the determined data structure of the mined feature blocks of the report template.
In one embodiment, the computing device 110 may define a legal data structure of mined feature blocks based on the determined data structure of the mined feature blocks of the report template. For example, a stock code may be defined as a structural form of "numeric. Note that because the data representation is different, multiple legal data structures may be defined for mining feature blocks. For example, a stock code may be defined as "number," "number | english character," "number english character," and so forth. Data conforming to such a structure can be determined to be legitimate and normalized in subsequent steps. Through the defined data structure, the computing device 110 may verify the legitimacy of the data mined in the above-described method.
In step 804, the computing device 110 may perform a normalization process on the mined data in the PDF file in response to the legitimacy of the mined data being legitimate.
In one embodiment, the computing device 110 may perform a normalization process on the mined data in the PDF file in response to the validity of the mined data in the above-described method being legal. Computing device 110 may normalize the mined legitimate one or more different forms of data into a standard data format. For example, stock codes expressed in the format of "numeral", "numeral | english character", and "numeral english character" are normalized to "numeral.
In step 806, the computing device 110 may determine other report templates based on the match data and re-mine the PDF files in response to the validity of the mined data being illegal.
In one embodiment, the computing device 110 may determine that the mined data is erroneous in response to the legitimacy of the mined data being illegal. For example, if the data mined when the stock codes are mined is Chinese characters, the report template is determined to be incorrect. The computing device 110 may determine other report templates and re-mine the PDF files with the match data calculated in the above-described method.
With this technical solution, data represented in a plurality of different types of data structures can be normalized to standard data. Meanwhile, whether the excavation is correct or not can be determined in the method.
FIG. 9 illustrates a flow diagram of a method 900 of performing a normalization process on mined data in the PDF file according to an embodiment of the disclosure. Method 900 corresponds to step 804 of method 800.
At step 902, the computing device 110 may define a standard expression of mined feature blocks of the report template.
In one embodiment, as described above, the computing device 110 may define a standard expression of mined feature blocks of the report template. The names used are different in different reports due to the same data. For example, financing cash flow, financing activity cash flow net, etc. may all represent the same data. The standard expression "financing cash flow" may thus be defined.
At step 904, computing device 110 may define a corresponding plurality of non-standard expressions based on the standard expressions.
In one embodiment, computing device 110 may define a corresponding plurality of non-standard expressions based on the standard expressions. For example, multiple non-standard expressions such as "financing activity cash flow," financing activity cash flow volume, "financing activity cash flow net amount" may be defined for the "financing cash flow. This step is equivalent to creating a set of look-up tables for different writing methods for the standard expression.
At step 906, the computing device 110 may uniformly convert the non-canonical expressions in the mined data to canonical expressions.
In one embodiment, the computing device 110 may uniformly convert all non-canonical expressions defined in the mined data to canonical expressions. For example, the mined "financing activity cash flow, financing activity cash flow net amount" can be uniformly converted into "financing cash flow". Namely, a plurality of non-standard expressions in the comparison relation table are normalized into a standard expression.
According to the same principle, it is also possible to convert numerical data into a correct floating point number or convert a plurality of units expressing different numbers into a unified unit, or the like.
By using the technical scheme, the data which are obtained by mining and expressed as different expressions can be normalized into the standard data of the unified expression.
Fig. 10 shows a flow diagram of a method 1000 of performing a normalization process on mined data in the PDF file according to an embodiment of the present disclosure. Method 1000 corresponds to step 804 of method 800.
At step 1002, the computing device 110 may determine, based on the mined data, the closest actual year associated with the stock data in the mined data.
In one embodiment, the computing device 110 may determine, based on the mined data, the closest actual year associated with the stock data in the mined data. For example, if the mined data includes year data such as 2020, 2021, 2022, and 2023, and assuming that the current year is 2021, 2022 starts with forecast data, so that "2021 year" may be defined as the closest actual year.
At step 1004, the computing device 110 may query the actual data of the stock data at the closest actual year.
In one embodiment, the computing device 110 may query the mined data for the actual data of the stock data that is closest to the actual year via a database or other means. For example, "financing cash flow" of stock data in 2021 may be queried and the queried data may be defined as real data.
In step 1006, the computing device 110 may compare the stock data to the real data to obtain units of the stock data.
In one embodiment, the computing device 110 may compare the stock data in the data mined in the above-described method to the actual data defined in step 1004 to determine whether the data is correct.
If the difference between the mined stock data and the real data expression is too large, namely a certain threshold value is different through certain operation, the mining can be considered as wrong, and the matching template needs to be determined again or the mining needs to be performed again.
If the expressions of the mined data and the real data are similar, namely the difference does not exceed a threshold value through certain operation, the unit of the stock data in the PDF table is calculated according to the conversion between the two data.
For example, if the query's real data is "1000000" and the mined data is "100", the unit of the data in the PDF table can be calculated as "(ten thousand)". For example, if the query's real data is "5000000" and the mined data is "200", then the mining may be considered as erroneous and further additional processing may be required.
By using the technical scheme, the mined data with different expressions can be normalized into the data with standard expression.
FIG. 11 shows a flow diagram of a method 1100 for deduplication against an acquired text chunk in accordance with an embodiment of the present disclosure.
In step 1102, the computing device 110 calculates the number of text characters of the acquired text block, i.e. the number of text characters, or text strings, within the text block acquired in the method described above. For example, the number of characters in the "balance sheet" text block may be determined to be 5.
At step 1104, the computing device 110 determines whether the calculated number of text characters is greater than or equal to a predetermined number of characters threshold. The user may set a character count threshold for the deduplication algorithm, e.g., (10 characters). The character number threshold is used for judging whether the text characters belong to long characters or short characters, and different de-duplication algorithms are applicable to different types of characters.
At step 1106, the computing device 110 calculates a similarity of the retrieved text blocks based on a first algorithm in response to determining that the calculated number of text characters is greater than or equal to a predetermined number of characters threshold. If the number of characters of the text is greater than or equal to the preset character number threshold value, the characters are determined to be long characters, and the similarity of the text blocks is calculated by applying a first algorithm. The first algorithm for long characters may be any deduplication algorithm that works well when applied to long strings, such as a simhash deduplication algorithm, a hashmap deduplication algorithm, or the like.
At step 1108, computing device 110 calculates a similarity of the acquired text blocks based on a second algorithm in response to determining that the calculated number of text characters is less than or equal to the predetermined character number threshold. And if the number of the text characters is less than the preset character number threshold value, the text characters are determined to be short characters, and the similarity of the text blocks is calculated by applying a second algorithm. The second algorithm for long characters may be any deduplication algorithm that is superior in performance when applied to short strings, such as a minihash deduplication algorithm, a set deduplication algorithm, or the like.
At step 1110, the computing device 110 performs deduplication for the retrieved text block based on the similarity calculation result. The similarity between the text blocks is calculated by the corresponding algorithms determined in steps 1106 and 1108, so that the text block with high similarity can be regarded as a repeated text block, and the removal or combination is performed on the repeated text block, thereby completing the deduplication.
By using the technical scheme, whether the repeated data or the repeated indexes exist in the table can be judged. If the activity exists, repeated cells can be merged or eliminated according to a deduplication algorithm, so that the table processing efficiency is improved.
FIG. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. For example, the computing device 110 as shown in fig. 1 may be implemented by the electronic device 1200. As shown, the electronic device 1200 includes a Central Processing Unit (CPU) 1201 that may perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 1202 or computer program instructions loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the random access memory 1203, various programs and data necessary for the operation of the electronic apparatus 1200 may also be stored. The central processing unit 1201, the read only memory 1202, and the random access memory 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
A number of components in the electronic device 1200 are connected to the input/output interface 1205, including: an input unit 1206 such as a keyboard, a mouse, a microphone, and the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The various processes and processes described above, such as methods 200, 300, 400, 500, 700, 800, 900, and 1100, may be performed by the central processing unit 1201. For example, in some embodiments, methods 200, 300, 400, 500, 700, 800, 900, and 1100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, some or all of the computer programs may be loaded and/or installed on the device 1200 via the read only memory 1202 and/or the communication unit 1209. When the computer program is loaded into the random access memory 1203 and executed by the central processing unit 1201, one or more of the actions of the methods 200, 300, 400, 500, 700, 800, 900 and 1100 described above may be performed.
The present disclosure relates to methods, apparatuses, systems, electronic devices, computer-readable storage media and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge computing devices. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be appreciated by persons skilled in the art that the present invention is not limited to the embodiments described above, but that the invention may be embodied in many other forms without departing from the spirit or scope of the invention. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made thereto without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims (12)

1. A method for mining PDF files, comprising:
analyzing the text block of the PDF file so as to obtain the coordinate information of the text block of the PDF file;
determining a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file;
matching one or more report templates of the target correlation mechanism with the coordinate information of the text block by using a matching algorithm, so as to determine matching degree data of the one or more report templates and the PDF file;
determining a report template of a target association mechanism corresponding to the PDF file based on the acquired matching degree data; and
and mining data corresponding to the determined report template in the PDF file based on the determined report template.
2. The method of claim 1, wherein determining a target association mechanism associated with the PDF file using a mechanism determination algorithm comprises:
constructing a mechanism key feature array for a plurality of mechanisms associated with a PDF file, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond;
based on the mechanism key feature array, searching the text blocks analyzed based on the PDF file so as to determine the occurrence times of key features associated with the mechanism; and
based on the determined number of times the key features associated with the organization occur, a weight sequence of the target association mechanism is calculated for determining the target association mechanism of the PDF file.
3. The method of claim 2, wherein determining a target associated mechanism for the PDF file further comprises:
determining a mechanism corresponding to a maximum value in the weight sequence;
determining whether the number of mechanisms corresponding to the maximum value is 1;
in response to determining that the number of mechanisms corresponding to the maximum value is 1, determining that the mechanism corresponding to the maximum value is a target associated mechanism of the PDF file; and
in response to determining that the number of institutions corresponding to the maximum value is greater than 1, determining that the target-associated institution is not identified.
4. The method of claim 1, wherein matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm comprises:
defining, for each of the one or more report templates, an identifying feature block, respectively;
acquiring coordinate information of the identification feature block;
for each of the one or more report templates, calculating a matching value of the report template and a text block according to a matching function based on coordinate information of the text block and coordinate information of an identification feature block of the report template; and
and calculating all the calculated matching values so as to determine the matching degree data of one or more report templates and the PDF file.
5. The method of claim 4, wherein calculating a match value of the report template to a text block according to a match function comprises:
the match value of the match function is a first predetermined value if it is determined that at least one of the following conditions is met:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block,
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block,
the abscissa value identifying the upper left coordinate of the feature block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the text block and the ordinate value identifying the upper left coordinate of the feature block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the text block,
recognizing that an abscissa value of a lower-right coordinate of the feature block falls within an abscissa value interval of an upper-left coordinate and a lower-right coordinate of the text block and a ordinate value of a lower-right coordinate of the feature block falls within an ordinate value interval of an upper-left coordinate and a lower-right coordinate of the feature block; and
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the matching function is a second predetermined value.
6. The method of claim 1 or 5, wherein mining data in the PDF file corresponding to the determined report template comprises:
for each report template of the one or more report templates, respectively defining a mining feature block;
determining coordinate information of the mining feature block;
calculating a matching value of the determined report template and the text block according to a mining matching function based on the coordinate information of the text block and the coordinate information of the mining feature block of the determined report template; and
mining the text blocks with the matching value not being the second predetermined value as the data corresponding to the determined report template.
7. The method of claim 6, further comprising:
verifying the validity of the mined data based on the determined data structure of the mined feature blocks of the report template;
performing normalization processing on the mined data in the PDF file in response to the validity of the mined data being legal; and
and in response to the legality of the mined data being illegal, determining other report templates based on the matching degree data and mining the PDF file again.
8. The method of claim 7, wherein performing a normalization process on the mined data in the PDF file comprises:
defining a standard expression of a mining feature block of the report template;
defining a plurality of corresponding non-standard expressions based on the standard expressions; and
and uniformly converting the non-standard expressions in the mined data into standard expressions.
9. The method of claim 8, wherein performing normalization processing on the mined data in the PDF file further comprises:
determining, based on the mined data, a closest actual year associated with stock data in the mined data;
inquiring real data of the stock data in the closest actual year; and
and comparing the stock data with the real data to obtain the unit of the stock data.
10. The method of claim 1, further comprising:
calculating the number of text characters of the acquired text block;
determining whether the calculated number of text characters is greater than or equal to a predetermined number of characters threshold;
in response to determining that the calculated number of text characters is greater than or equal to a predetermined character number threshold, calculating a similarity of the acquired text blocks based on a first algorithm;
calculating a similarity of the acquired text blocks based on a second algorithm in response to determining that the calculated number of text characters is less than or equal to a predetermined character number threshold; and
and based on the similarity calculation result, carrying out deduplication on the acquired text block.
11. A computing device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor;
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
CN202210089715.3A 2022-01-26 2022-01-26 Method, apparatus and medium for mining PDF files Active CN114116616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210089715.3A CN114116616B (en) 2022-01-26 2022-01-26 Method, apparatus and medium for mining PDF files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210089715.3A CN114116616B (en) 2022-01-26 2022-01-26 Method, apparatus and medium for mining PDF files

Publications (2)

Publication Number Publication Date
CN114116616A true CN114116616A (en) 2022-03-01
CN114116616B CN114116616B (en) 2022-05-17

Family

ID=80361462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210089715.3A Active CN114116616B (en) 2022-01-26 2022-01-26 Method, apparatus and medium for mining PDF files

Country Status (1)

Country Link
CN (1) CN114116616B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201620A (en) * 2021-12-17 2022-03-18 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for mining PDF tables in PDF file

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040172245A1 (en) * 2003-02-28 2004-09-02 Lee Rosen System and method for structuring speech recognized text into a pre-selected document format
US20060041539A1 (en) * 2004-06-14 2006-02-23 Matchett Douglas K Method and apparatus for organizing, visualizing and using measured or modeled system statistics
US20110311140A1 (en) * 2010-06-18 2011-12-22 Google Inc. Selecting Representative Images for Establishments
US20140280277A1 (en) * 2013-03-15 2014-09-18 Global Precision Solutions, Llp. System and method for integration and correlation of gis data
US20170193608A1 (en) * 2015-11-29 2017-07-06 Vatbox, Ltd. System and method for automatically generating reporting data based on electronic documents
CN107357779A (en) * 2017-06-27 2017-11-17 北京神州泰岳软件股份有限公司 A kind of method and device for obtaining organization names
CN108280173A (en) * 2018-01-22 2018-07-13 深圳市和讯华谷信息技术有限公司 A kind of key message method for digging, medium and the equipment of non-structured text
CN108962347A (en) * 2018-06-23 2018-12-07 北京众信易保科技有限公司 Position algorithm based on JAVA checks UP the resolution system of report
CN111797729A (en) * 2020-06-19 2020-10-20 翰博瑞强(上海)医药科技有限公司 Automatic identification method for assay report
CN112509661A (en) * 2021-02-03 2021-03-16 南京吉拉福网络科技有限公司 Methods, computing devices, and media for identifying physical examination reports
CN112800848A (en) * 2020-12-31 2021-05-14 中电金信软件有限公司 Structured extraction method, device and equipment of information after bill identification
US20210248420A1 (en) * 2020-02-07 2021-08-12 International Business Machines Corporation Automated generation of structured training data from unstructured documents

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040172245A1 (en) * 2003-02-28 2004-09-02 Lee Rosen System and method for structuring speech recognized text into a pre-selected document format
US20060041539A1 (en) * 2004-06-14 2006-02-23 Matchett Douglas K Method and apparatus for organizing, visualizing and using measured or modeled system statistics
US20110311140A1 (en) * 2010-06-18 2011-12-22 Google Inc. Selecting Representative Images for Establishments
US20140280277A1 (en) * 2013-03-15 2014-09-18 Global Precision Solutions, Llp. System and method for integration and correlation of gis data
US20170193608A1 (en) * 2015-11-29 2017-07-06 Vatbox, Ltd. System and method for automatically generating reporting data based on electronic documents
CN107357779A (en) * 2017-06-27 2017-11-17 北京神州泰岳软件股份有限公司 A kind of method and device for obtaining organization names
CN108280173A (en) * 2018-01-22 2018-07-13 深圳市和讯华谷信息技术有限公司 A kind of key message method for digging, medium and the equipment of non-structured text
CN108962347A (en) * 2018-06-23 2018-12-07 北京众信易保科技有限公司 Position algorithm based on JAVA checks UP the resolution system of report
US20210248420A1 (en) * 2020-02-07 2021-08-12 International Business Machines Corporation Automated generation of structured training data from unstructured documents
CN111797729A (en) * 2020-06-19 2020-10-20 翰博瑞强(上海)医药科技有限公司 Automatic identification method for assay report
CN112800848A (en) * 2020-12-31 2021-05-14 中电金信软件有限公司 Structured extraction method, device and equipment of information after bill identification
CN112509661A (en) * 2021-02-03 2021-03-16 南京吉拉福网络科技有限公司 Methods, computing devices, and media for identifying physical examination reports

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DENNAI,A.等: "Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms", 《INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCE》 *
杨瑞仙等: "面向知识评价的我国科研机构命名识别方法研究", 《情报杂志》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201620A (en) * 2021-12-17 2022-03-18 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for mining PDF tables in PDF file

Also Published As

Publication number Publication date
CN114116616B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN110909226B (en) Financial document information processing method and device, electronic equipment and storage medium
US20120102002A1 (en) Automatic data validation and correction
CN109685056A (en) Obtain the method and device of document information
CN105335360B (en) The method and apparatus for generating file structure
CN111512315A (en) Block-wise extraction of document metadata
CN110580308A (en) information auditing method and device, electronic equipment and storage medium
US11880435B2 (en) Determination of intermediate representations of discovered document structures
EP4124988A1 (en) System and method for automatically tagging documents
CN112084448B (en) Similar information processing method and device
Ha et al. Information extraction from scanned invoice images using text analysis and layout features
CN109034199B (en) Data processing method and device, storage medium and electronic equipment
CN114022888B (en) Method, apparatus and medium for identifying PDF form
CN114116616B (en) Method, apparatus and medium for mining PDF files
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
CN114493255A (en) Enterprise abnormity monitoring method based on knowledge graph and related equipment thereof
CN114201620A (en) Method, apparatus and medium for mining PDF tables in PDF file
CN115687621A (en) Short text label labeling method and device
CN113255369B (en) Text similarity analysis method and device and storage medium
CN113168527A (en) System and method for extracting information from entity documents
US20190172171A1 (en) Automatically attaching optical character recognition data to images
CN115544213B (en) Method, device and storage medium for acquiring information in text
CN116522872A (en) Similarity calculation-based metadata field Chinese name completion method, storage medium and system
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
JP6155409B1 (en) Financial analysis system and financial analysis program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 201203 Room 501, building 4, No. 690, Bibo Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee after: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Address before: 201203 building 4, No. 690, Bibo Road, Zhangjiang Gaoke, Pudong New Area, Shanghai

Patentee before: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

CP02 Change in the address of a patent holder
CP03 Change of name, title or address

Address after: Room 201-1 and Room 201-3, Building 4, No. 690 Bibo Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Patentee after: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Country or region after: China

Address before: 201203 Room 501, building 4, No. 690, Bibo Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee before: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Country or region before: China

CP03 Change of name, title or address