CN115422340A - Numerical value extraction method and device, electronic equipment and storage medium - Google Patents
Numerical value extraction method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115422340A CN115422340A CN202211141324.8A CN202211141324A CN115422340A CN 115422340 A CN115422340 A CN 115422340A CN 202211141324 A CN202211141324 A CN 202211141324A CN 115422340 A CN115422340 A CN 115422340A
- Authority
- CN
- China
- Prior art keywords
- determining
- word
- sentence
- tree structure
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention provides a numerical value extraction method, a numerical value extraction device, electronic equipment and a storage medium, and relates to the technical field of data processing, wherein the method comprises the following steps: the method comprises the steps of obtaining a PDF file to be extracted, determining coordinates of elements in the PDF file to be extracted, determining sentences and/or tables in the PDF file to be extracted based on the coordinates of the elements, analyzing the sentences to obtain words and parameter information corresponding to the words, creating tree structures of the sentences based on the words and the parameter information corresponding to the words, obtaining keywords input by a user, and determining numerical values corresponding to the keywords from the tree structures and/or the tables based on the keywords. All data corresponding to the keywords can be quickly and accurately found from a large number of PDF files, the data extraction cost can be reduced, and the problems of inaccurate data extraction and low efficiency caused by manual statistics are solved.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a numerical value extraction method and device, electronic equipment and a storage medium.
Background
The scientific and technological literature is the crystal accumulated in the scientific and technological development technology and is the necessary technical literature for scientific research, organizational production and product quality improvement of modern enterprises. With the continuous development of scientific technology and the increasingly abundant accumulation of scientific and technical literature, the internet has a plurality of public PDF scientific and technical literature resources, and the resources contain a plurality of valuable and valuable information worthy of mining. However, compared with the traditional web pages, words, txt and other documents, the technical literature has various PDF formats, lacks structural information, and is very difficult to extract form data, picture data and the like.
At present, the extraction of numerical values in the PDF files is generally performed manually, and when the PDF files are too many, a great amount of manpower is required to extract the numerical values required in the PDF files. This kind of extraction mode is easy to make mistakes and inefficiency.
Disclosure of Invention
The invention aims to provide a numerical value extraction method, a numerical value extraction device, electronic equipment and a storage medium, which can improve the efficiency of extracting numerical values in PDF.
In order to achieve the above object, the embodiments of the present application adopt the following technical solutions:
in a first aspect, an embodiment of the present application provides a method for extracting a value, where the method includes:
acquiring a PDF file to be extracted;
determining coordinates of each element in the PDF file to be extracted, wherein each element comprises characters and lines;
determining sentences and/or tables in the PDF file to be extracted based on the coordinates of each element;
analyzing the sentences to obtain each word and parameter information corresponding to each word, wherein the parameter information comprises word attributes and upper-lower word relations;
creating a tree structure of each sentence based on each word and parameter information corresponding to each word;
acquiring a keyword input by a user;
and determining a numerical value corresponding to the keyword from each tree structure and/or table based on the keyword.
In an alternative embodiment, the step of determining the sentence in the PDF file to be extracted based on the coordinates of each element includes:
based on the coordinates of each element;
determining a distance between adjacent elements;
and when the distance between the adjacent elements is smaller than the preset distance, constructing a sentence based on the adjacent elements.
In an alternative embodiment, the method further comprises:
for each sentence, determining a first starting coordinate of the sentence and a second starting coordinate of a sentence next to the sentence;
under the condition that the abscissa of the first starting coordinate is the same as the abscissa of the second starting coordinate, determining whether a line exists in a preset range of the sentence or not;
if so, each sentence is formed into a table.
In an optional embodiment, the step of determining, from each tree structure, a numerical value corresponding to the keyword based on the keyword includes:
traversing each tree structure based on the keywords;
determining word attributes of target words under the condition that the target words in a target tree structure are matched with the keywords;
when the word attribute of a target word is a noun, determining an object corresponding to the target word in the target tree structure;
and acquiring a numerical value corresponding to the object in the target tree structure.
In an optional embodiment, the step of traversing each tree structure based on the keyword includes:
and traversing each tree structure in a middle-order traversal mode based on the keywords.
In an alternative embodiment, the step of determining, from each table, a numerical value corresponding to the keyword based on the keyword includes:
and determining a numerical value corresponding to the keyword from each table based on a regular matching mode based on the keyword.
In an alternative embodiment, the method further comprises:
and saving each tree structure and/or each table based on a protobuf format.
In a second aspect, an embodiment of the present application provides a numerical value extraction apparatus, including:
the first acquisition module is used for acquiring a PDF file to be extracted;
the first determination module is used for extracting coordinates of each element in the PDF file to be extracted, wherein each element comprises characters and lines;
the second determining module is used for determining sentences and/or tables in the PDF file to be extracted based on the coordinates of each element;
the analysis module is used for analyzing the sentences to obtain each word and parameter information corresponding to each word, wherein the parameter information comprises word attributes and the upper-lower word relation;
the creating module is used for creating a tree structure of each sentence based on each word and parameter information corresponding to each word;
the second acquisition module is used for acquiring keywords input by a user;
and the third determining module is used for determining numerical values corresponding to the keywords from each tree structure and/or table based on the keywords.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the numerical value extraction method when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the numerical value extraction method.
The application has the following beneficial effects:
the method comprises the steps of obtaining a PDF file to be extracted, determining coordinates of elements in the PDF file to be extracted, determining sentences and/or tables in the PDF file to be extracted based on the coordinates of the elements, analyzing the sentences to obtain words and parameter information corresponding to the words, creating tree structures of the sentences based on the words and the parameter information corresponding to the words, obtaining keywords input by a user, and determining numerical values corresponding to the keywords from the tree structures and/or the tables based on the keywords. All data corresponding to the keywords can be quickly and accurately found from a large number of PDF files, the data extraction cost can be reduced, and the problems of inaccurate data extraction and low efficiency caused by manual statistics are solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic block diagram of an electronic device according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a step of extracting a value according to an embodiment of the present invention;
FIG. 3 is a second flowchart illustrating a step of extracting a value according to an embodiment of the present invention;
FIG. 4 is a third flowchart illustrating a numerical extraction step according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a PDF file to be extracted according to an embodiment of the present invention;
FIG. 6 is a fourth flowchart illustrating a numerical extraction procedure according to an embodiment of the present invention;
fig. 7 is a block diagram of a structure of a digital value extracting apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.
Furthermore, the appearances of the terms "first," "second," and the like, if any, are only used to distinguish one description from another and are not to be construed as indicating or implying relative importance.
In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in this application will be understood to be a specific case for those of ordinary skill in the art.
Through a great deal of research by the inventor, the extraction of the numerical values in the PDF file is generally performed in a manual mode at present, and when the PDF file is too many, a great deal of manpower is required to extract the numerical values required in the PDF file. This type of extraction is prone to errors and inefficient.
In view of the discovery of the above problems, the present embodiment provides a method, an apparatus, an electronic device, and a storage medium for extracting a numerical value, which are capable of determining coordinates of each element in a PDF file to be extracted by acquiring the PDF file to be extracted, determining a sentence and/or a table in the PDF file to be extracted based on the coordinates of each element, parsing the sentence to obtain each word and parameter information corresponding to each word, creating a tree structure of each sentence based on each word and parameter information corresponding to each word, acquiring a keyword input by a user, and determining a numerical value corresponding to the keyword from each tree structure and/or table based on the keyword. All data corresponding to the keywords can be quickly and accurately found from a large number of PDF files, the data extraction cost can be reduced, the problems of inaccurate data extraction and low efficiency caused by manual statistics are solved, and the scheme provided by the embodiment is elaborated in detail below.
The embodiment provides an electronic device capable of extracting numerical values. In one possible implementation, the electronic Device may be a user terminal, for example, the electronic Device may be, but is not limited to, a server, a smart phone, a Personal Computer (PC), a tablet computer, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure. The electronic device 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
The electronic apparatus 100 includes a value extraction device 110, a memory 120, and a processor 130.
The elements of the memory 120 and the processor 130 are electrically connected to each other directly or indirectly to achieve data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The value extracting device 110 includes at least one software functional module which can be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 130 is used for executing executable modules stored in the memory 120, such as software functional modules and computer programs included in the numerical value extracting device 110.
The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 120 is used for storing a program, and the processor 130 executes the program after receiving the execution instruction.
Referring to fig. 2, fig. 2 is a flowchart of a method for extracting a value applied to the electronic device 100 of fig. 1, and the method including various steps will be described in detail below.
Step 201: and acquiring the PDF file to be extracted.
Step 202: and extracting the coordinates of each element in the PDF file to be extracted.
Wherein each element includes text and lines.
Step 203: and determining sentences and/or tables in the PDF file to be extracted based on the coordinates of each element.
Step 204: and analyzing the sentences to obtain each word and parameter information corresponding to each word.
The parameter information comprises word attributes and the upper and lower word relations.
Step 205: and creating a tree structure of each sentence based on each word and the parameter information corresponding to each word.
Step 206: and acquiring the keywords input by the user.
Step 207: and determining a numerical value corresponding to the keyword from each tree structure and/or table based on the keyword.
It should be noted that the PDF files to be extracted may be financial data PDF files, experimental data PDF files, research data PDF files, and the like, and the type of the PDF files to be extracted is not specifically limited in the present application.
In order to facilitate the extraction of the numerical values required in the PDF file to be extracted, the coordinates of each element in the PDF file to be extracted need to be extracted first, and since the elements in the PDF file to be extracted include characters and lines, the coordinates of each character and the coordinates of each line in the PDF file to be extracted are determined.
It should be noted that, in the coordinates of the text, the left-bottom position of the current page may be used as a zero point, and the abscissa and the ordinate of the text in the current coordinate system are determined as the coordinates of the text.
The coordinates of the lines may be based on the lower left position of the current page as a zero point, the lines may be lines parallel to the horizontal axis and lines parallel to the vertical axis, when the lines are parallel to the horizontal axis, the vertical coordinates of the lines are determined, and the start coordinates and the end coordinates of the horizontal coordinates of the lines are determined, and when the lines are parallel to the vertical axis, the horizontal coordinates of the lines are determined, and the start coordinates and the end coordinates of the vertical coordinates of the lines are determined.
And determining sentences and/or tables in the PDF file to be extracted based on the relation between the coordinates of the elements.
In an exemplary embodiment, when it is determined that five text elements are included in the PDF file to be extracted, the coordinates of each element are determined, and when the distance between the coordinates of two adjacent text elements is smaller than a preset distance, it is determined that the five elements constitute one sentence. Carrying out jieba word segmentation on the sentence to obtain a plurality of words, processing the sentence through an NLP natural language, and determining word attributes of the plurality of words, for example: the words may be subject attributes, predicate attributes, object attributes, nouns, verbs, and the like. The upper and lower word relations are formed among each word, for example, the sentence is: the net profit value is A. After the sentences are segmented, the net profit value is A. The net interest rate value is the last word of ' is ', the next word of ' is ', the ' A ' is ', and the ' net interest rate value ', ' is ' and ' A ' are formed into a tree structure.
When the keywords input by the user are obtained, when the keywords are the net interest rate value, the tree structure of each sentence can be searched in a traversing mode, the numerical value corresponding to the keywords is determined to be 'A' from the tree structures formed by the net interest rate value ', the sum of the net interest rate value' and the sum of the net interest rate value 'A' in the tree structures, and the A is extracted.
In another example, when it is determined that 5 sentence elements are contained in the PDF file to be extracted and a line element is also contained in the PDF file to be extracted, when a line is detected around each sentence, it is determined that five sentences in the PDF file to be extracted constitute a table.
In another example, when it is determined that 10 text elements are included in the PDF file to be extracted, it is determined that the 10 text elements constitute two sentences based on the coordinates of each element, and the first row sentence and the second row sentence are left-aligned, it is determined that the two sentences constitute a table.
And storing each tree structure and/or each table based on a protobuf format, so that more storage space is avoided being occupied, and the extraction speed can be improved.
When the keywords input by the user are obtained, each table can be searched, and numerical values corresponding to the keywords are determined from each table.
The method comprises the steps of obtaining a PDF file to be extracted, determining coordinates of elements in the PDF file to be extracted, determining sentences and/or tables in the PDF file to be extracted based on the coordinates of the elements, analyzing the sentences to obtain words and parameter information corresponding to the words, creating a tree structure of the sentences based on the words and the parameter information corresponding to the words, obtaining keywords input by a user, and determining numerical values corresponding to the keywords from the tree structure and/or the tables based on the keywords. All data corresponding to the keywords can be quickly and accurately found from a large number of PDF files, the data extraction cost can be reduced, and the problems of inaccurate data extraction and low efficiency caused by manual statistics are solved.
As to how to determine the sentences of the PDF file to be extracted, as shown in fig. 3, there is provided a numerical value extraction method including the steps of:
step 203-1: based on the coordinates of each element.
Step 203-2: the distance between adjacent elements is determined.
Step 203-3: and when the distance between the adjacent elements is smaller than the preset distance, constructing a sentence based on the adjacent elements.
For example, there are various ways to determine the coordinates of each character in the PDF file to be extracted, determine the distance between two adjacent character coordinates, and determine the distance between two character coordinates, and in an example, the distance between two coordinates may be calculated based on the euclidean distance.
And comparing the distance between the adjacent characters with a preset distance, wherein the distance between the adjacent characters is greater than the preset distance, and the adjacent characters cannot form a sentence. And when the distance between the adjacent characters is smaller than the preset distance, forming a sentence based on the adjacent elements.
It should be noted that the preset distance may be set to be 3mm, 4mm, 5mm, and the like, which is not specifically limited in the present application.
As for how to determine the table of the PDF file to be extracted, as shown in fig. 4, there is provided a numerical value extraction method including the steps of:
step 301: for each sentence, a first starting coordinate of the sentence and a second starting coordinate of a sentence next to the sentence are determined.
Step 302: and under the condition that the abscissa of the first starting coordinate is the same as the abscissa of the second starting coordinate, determining whether a line exists in the preset range of the sentence.
Step 303: if so, each sentence is formed into a table.
Illustratively, when a table in a PDF file to be extracted is determined, coordinates of each acquired text element are determined first, a sentence is formed based on the coordinates of each text element, a first start coordinate of each sentence is determined, a first start coordinate of each sentence and a second start coordinate of a next row of sentences are determined, when abscissa of the first start coordinate and the second start coordinate are the same, and a line exists in a preset range of the sentence, that is, when a distance between a text element and the line in the sentence meets a preset condition, the sentence with the sentence meeting the condition is formed into the table, and the preset condition is that the line exists in the preset range of the sentence.
Illustratively, as shown in fig. 5, for a PDF file to be extracted, the sentences determined in the figure are sentences in a first column, first start coordinates and second start coordinates of the sentences in the first column in a first row and a first column in a second row are determined, that is, first start coordinates of "main accounting data" and second start coordinates of "business total income" are determined, the first start coordinates of "main accounting data" and the second start coordinates of "business total income" are determined to be the same, the first start coordinates and the second start coordinates of each sentence and a next row of sentences are determined for each sentence, all sentences satisfying the same abscissa of the first start coordinates and the second start coordinates are determined, whether a line exists within a preset range of each sentence, and each sentence in which the line exists forms a table.
In determining the table of the PDF file to be extracted, in another example, the first start coordinates and the second start coordinates of each sentence and the next line of sentences are determined, respectively, all sentences satisfying the same abscissa of the first start coordinates and the second start coordinates are determined, and the sentences having the same abscissa are formed into the table.
For determining a numerical value from a tree structure of sentences based on a keyword input by a user, as shown in fig. 6, there is provided a numerical value extraction method including the steps of:
step 207-1: and traversing each tree structure based on the keywords.
Step 207-2: and determining the word attribute of the target word under the condition that the target word in the target tree structure is matched with the keyword.
Step 207-3: and when the word attribute of the target word is a noun, determining an object corresponding to the target word in the target tree structure.
Step 207-4: and acquiring a numerical value corresponding to the object in the target tree structure.
Traversing the tree structure corresponding to each sentence, determining the matched tree structure as a target tree structure when the tree structure matched with the keyword is detected, determining the word attribute of the target word matched with the keyword from the target tree structure, determining the object of the target word in the target tree structure when the target word is a noun, determining the numerical value after the object in the target tree structure, and outputting the numerical value.
In another example, a plurality of sentences may be formed into a paragraph, a tree structure corresponding to the paragraph is formed based on each word in the paragraph and a word attribute of each word, and when a keyword input by a user is received, the tree structure corresponding to each paragraph is traversed, and a numerical value corresponding to the keyword is extracted from each tree structure.
It should be noted that, based on the keyword, for the traversal of the tree structure, the numerical value corresponding to the keyword may be determined from the tree structure in a manner of forward traversal, middle-order traversal, or backward traversal, and the traversal manner is not specifically limited in the present application.
When the numerical value corresponding to the keyword needs to be determined from the table of the PDF file to be extracted, the numerical value corresponding to the keyword can be determined from the table based on a regular matching manner.
Referring to fig. 7, an embodiment of the present application further provides a value extraction apparatus 110 applied to the electronic device 100 shown in fig. 1, where the value extraction apparatus 110 includes:
a first obtaining module 111, configured to obtain a PDF file to be extracted;
a first determining module 112, configured to extract coordinates of each element in the PDF file to be extracted, where each element includes a text and a line;
a second determining module 113, configured to determine, based on coordinates of each element, a sentence and/or a table in the PDF file to be extracted;
an analysis module 114, configured to analyze the sentence to obtain each word and parameter information corresponding to each word, where the parameter information includes a word attribute and a top-bottom word relationship;
a creating module 115, configured to create a tree structure of each sentence based on each word and parameter information corresponding to each word;
a second obtaining module 116, configured to obtain a keyword input by a user;
a third determining module 117, configured to determine, based on the keyword, a numerical value corresponding to the keyword from each tree structure and/or table.
Preferably, the second determining module 113 is further configured to:
based on the coordinates of each element;
determining a distance between adjacent elements;
and when the distance between the adjacent elements is smaller than the preset distance, constructing a sentence based on the adjacent elements.
Preferably, the apparatus further comprises:
a fourth determining module, configured to determine, for each sentence, a first start coordinate of the sentence and a second start coordinate of a sentence next to the sentence;
a fifth determining module, configured to determine whether a line exists in the preset range of the sentence when the abscissa of the first start coordinate is the same as the abscissa of the second start coordinate;
and the building module is used for forming each sentence into a table if the sentence exists.
Preferably, the third determining module 117 is further configured to:
traversing each tree structure based on the keyword;
determining word attributes of target words under the condition that the target words in the target tree structure are matched with the keywords;
when the word attribute of a target word is a noun, determining an object corresponding to the target word in the target tree structure;
and acquiring a numerical value corresponding to the object in the target tree structure.
Preferably, the third determining module 117 is further configured to:
and traversing each tree structure in a middle-order traversal mode based on the keywords.
Preferably, the third determining module 117 is further configured to:
and determining a numerical value corresponding to the keyword from each table based on a regular matching mode based on the keyword.
Preferably, the tree structures and/or the tables are saved based on a protobuf format.
The present application further provides an electronic device 100, where the electronic device 100 includes a processor 130 and a memory 120. The memory 120 stores computer-executable instructions that, when executed by the processor 130, implement the numerical extraction method.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by the processor 130, the method for extracting the numerical value is implemented.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method of numerical extraction, the method comprising:
acquiring a PDF file to be extracted;
determining coordinates of each element in the PDF file to be extracted, wherein each element comprises characters and lines;
determining sentences and/or tables in the PDF file to be extracted based on the coordinates of each element;
analyzing the sentences to obtain each word and parameter information corresponding to each word, wherein the parameter information comprises word attributes and upper-lower word relations;
creating a tree structure of each sentence based on each word and parameter information corresponding to each word;
acquiring a keyword input by a user;
and determining a numerical value corresponding to the keyword from each tree structure and/or table based on the keyword.
2. The method according to claim 1, wherein the step of determining the sentence in the PDF file to be extracted based on the coordinates of each element comprises:
based on the coordinates of each element;
determining a distance between adjacent elements;
and when the distance between the adjacent elements is smaller than the preset distance, constructing a sentence based on the adjacent elements.
3. The method of claim 2, further comprising:
for each sentence, determining a first starting coordinate of the sentence and a second starting coordinate of a sentence next to the sentence;
under the condition that the abscissa of the first starting coordinate is the same as the abscissa of the second starting coordinate, determining whether a line exists in a preset range of the sentence or not;
if so, each sentence is formed into a table.
4. The method according to claim 1, wherein the step of determining a value corresponding to the keyword from each tree structure based on the keyword comprises:
traversing each tree structure based on the keyword;
determining word attributes of target words under the condition that the target words in the target tree structure are matched with the keywords;
when the word attribute of a target word is a noun, determining an object corresponding to the target word in the target tree structure;
and acquiring a numerical value corresponding to the object in the target tree structure.
5. The method of claim 4, wherein traversing each tree structure based on the key comprises:
and traversing each tree structure in a middle-order traversal mode based on the keywords.
6. The method of claim 1, wherein the step of determining a value corresponding to the keyword from each table based on the keyword comprises:
and determining a numerical value corresponding to the keyword from each table based on a regular matching mode based on the keyword.
7. The method of claim 1, further comprising:
and saving each tree structure and/or each table based on a protobuf format.
8. A numerical value extraction apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring a PDF file to be extracted;
the first determining module is used for extracting coordinates of each element in the PDF file to be extracted, wherein each element comprises characters and lines;
the second determining module is used for determining sentences and/or tables in the PDF file to be extracted based on the coordinates of each element;
the analysis module is used for analyzing the sentences to obtain each word and parameter information corresponding to each word, wherein the parameter information comprises word attributes and the upper-lower word relation;
the creating module is used for creating a tree structure of each sentence based on each word and parameter information corresponding to each word;
the second acquisition module is used for acquiring keywords input by a user;
and the third determining module is used for determining numerical values corresponding to the keywords from each tree structure and/or table based on the keywords.
9. An electronic device, comprising a memory storing a computer program and a processor implementing the steps of the method according to any of claims 1-7 when the processor executes the computer program.
10. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, performing the steps of the method as set forth in any one of the claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211141324.8A CN115422340A (en) | 2022-09-20 | 2022-09-20 | Numerical value extraction method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211141324.8A CN115422340A (en) | 2022-09-20 | 2022-09-20 | Numerical value extraction method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115422340A true CN115422340A (en) | 2022-12-02 |
Family
ID=84203963
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211141324.8A Pending CN115422340A (en) | 2022-09-20 | 2022-09-20 | Numerical value extraction method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115422340A (en) |
-
2022
- 2022-09-20 CN CN202211141324.8A patent/CN115422340A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109062874B (en) | Financial data acquisition method, terminal device and medium | |
US8838657B1 (en) | Document fingerprints using block encoding of text | |
CN111797594B (en) | Character string processing method based on artificial intelligence and related equipment | |
CN113760891B (en) | Data table generation method, device, equipment and storage medium | |
CN111291572A (en) | Character typesetting method and device and computer readable storage medium | |
CN112214987B (en) | Information extraction method, extraction device, terminal equipment and readable storage medium | |
CN115828874A (en) | Industry table digital processing method based on image recognition technology | |
CN113408660A (en) | Book clustering method, device, equipment and storage medium | |
CN115331247A (en) | Document structure identification method and device, electronic equipment and readable storage medium | |
CN115098440A (en) | Electronic archive query method, device, storage medium and equipment | |
CN109670183B (en) | Text importance calculation method, device, equipment and storage medium | |
CN108170799A (en) | A kind of Frequent episodes method for digging of mass data | |
CN108875050B (en) | Text-oriented digital evidence-obtaining analysis method and device and computer readable medium | |
CN103365934A (en) | Extracting method and device of complex named entity | |
CN107203509B (en) | Title generation method and device | |
EP3564833B1 (en) | Method and device for identifying main picture in web page | |
CN111291547B (en) | Template generation method, device, equipment and medium | |
CN113255369A (en) | Text similarity analysis method and device and storage medium | |
CN113033177B (en) | Method and device for analyzing electronic medical record data | |
CN117423124A (en) | Table data processing method, device, equipment and medium based on table image | |
CN109145879B (en) | Method, equipment and storage medium for identifying printing font | |
CN115422340A (en) | Numerical value extraction method and device, electronic equipment and storage medium | |
CN107145947B (en) | Information processing method and device and electronic equipment | |
CN114155547B (en) | Chart identification method, device, equipment and storage medium | |
US10650020B1 (en) | Analyzing transformations for preprocessing datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |