CN107818075A - Form data structuring extracting method, electronic equipment and computer-readable recording medium - Google Patents
Form data structuring extracting method, electronic equipment and computer-readable recording medium Download PDFInfo
- Publication number
- CN107818075A CN107818075A CN201710962303.5A CN201710962303A CN107818075A CN 107818075 A CN107818075 A CN 107818075A CN 201710962303 A CN201710962303 A CN 201710962303A CN 107818075 A CN107818075 A CN 107818075A
- Authority
- CN
- China
- Prior art keywords
- word
- page
- previous
- document
- line
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
- G06F40/18—Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
Abstract
The invention discloses a kind of form data structuring extracting method, the method comprising the steps of:Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;According to the positional information and label information of every style of writing word, specified from this and line feed situation and cross-page situation are identified in form of document;When specified from this identified in form of document line feed situation when, then according to first remodeling rule by form data carry out branch's storage and point row storage;When specified from this identify cross-page situation in form of document when, then according to second remodeling rule by form data carry out branch's storage and point row storage.The present invention can carry out structural data extraction and storage.
Description
Technical field
The present invention relates to computer information technology field, more particularly to a kind of form data structuring extracting method, electronics
Equipment and computer-readable recording medium.
Background technology
The existing form data being directed in PDF annual reports extracts, and is generally based on OCR technique.But there is line feed, changing
Under page, spcial character disturbed condition, OCR technique can not reduce and remold original form data, and further structuring is integrated,
And the difficulty in understanding is caused to user, also it is unfavorable for follow-up information and compares.Therefore form data extraction of the prior art
Method design is not reasonable, needs improvement badly.
The content of the invention
In view of this, the present invention proposes a kind of form data structuring extracting method, electronic equipment and computer-readable deposited
Storage media, the positional information and label information of form word content in document (such as PDF document) are specified by analyzing, can be identified
The line feed situation and cross-page situation gone out in form (such as PDF annual reports form), and to there is the form of line feed situation and cross-page situation
Carry out structural data extraction and store.
First, to achieve the above object, the present invention proposes a kind of electronic equipment, and the electronic equipment includes memory and place
Device is managed, the form data structuring extraction system that can be run on the processor, the form are stored with the memory
Following steps are realized when message structure extraction system is by the computing device:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to it is every style of writing word positional information and label information, specified from this identified in form of document line feed situation and
Cross-page situation;
When specified from this line feed situation is identified in form of document when, then form data entered according to the first remodeling rule
Row branch stores and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data entered according to the second remodeling rule
Row branch stores and point row storage.
Preferably, the first remodeling rule includes:Upper edge coordinate identical word is stored as same a line, by the left side
Same row is stored as along coordinate identical word;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced
Form splicing form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and will be left in the splicing form
Edge coordinate identical word is stored as same row.
Preferably, it is described to delete the previous form institute header of the footer in face and next form place next page on the previous page
Including:
According to the previous page and the label information of next page and the ad hoc rules of the specified document, document is specified to this
In the footer scope of the previous page and the header scope of next page positioned, and the footer scope and page determined according to positioning
Eyebrow scope deletes the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the prevpage is chosen
First ratio content of the page length in face, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:According to order from top to bottom, the next page is chosen
Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, the line feed situation includes line feed and end of line line feed in row;
Wherein, the identification of line feed includes in the row:
The word content positional information of each cell in the style of writing word is obtained, wherein, the word content of each cell
Positional information includes the upper edge coordinate of the word content of each cell;And
From the coordinate identical cell of word content upper edge, the cell location and last occurred for the first time is obtained
The cell location of secondary appearance, all lists for the cell location that the cell location that first time occurs once is occurred to the end
First lattice are defined as same a line, and judge text between the cell location of the cell location and last time appearance occurred for the first time
Coordinate different cell in word content upper edge is the interior cell to enter a new line of row.
Preferably, the identification of the end of line line feed includes:
If remaining cell be present in current line word after line feed identification in row, the word of the remaining cell is obtained
Content-location information, wherein, the word content positional information of the remaining cell is included in the word of the remaining cell
The upper edge coordinate of appearance;
Calculate the word content upper edge coordinate and current line and the text of all cells of next line of the remaining cell
The distance of word content upper edge coordinate, or calculate word content upper edge coordinate and current line and upper one of the remaining cell
The distance of the word content upper edge coordinate of all cells of row;And
If appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, and sentence
The fixed remaining cell is the end of line line feed cell of current line.
In addition, to achieve the above object, the present invention also provides a kind of form data structuring extracting method, this method application
In electronic equipment, methods described includes:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to it is every style of writing word positional information and label information, specified from this identified in form of document line feed situation and
Cross-page situation;
When specified from this line feed situation is identified in form of document when, then form data entered according to the first remodeling rule
Row branch stores and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data entered according to the second remodeling rule
Row branch stores and point row storage.
Preferably, the first remodeling rule includes:Upper edge coordinate identical word is stored as same a line, by the left side
Same row is stored as along coordinate identical word;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced
Form splicing form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and will be left in the splicing form
Edge coordinate identical word is stored as same row.
Preferably, it is described to delete the previous form institute header of the footer in face and next form place next page on the previous page
Including:
According to the previous page and the label information of next page and the ad hoc rules of the specified document, document is specified to this
In the footer scope of the previous page and the header scope of next page positioned, and the footer scope and page determined according to positioning
Eyebrow scope deletes the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the prevpage is chosen
First ratio content of the page length in face, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:According to order from top to bottom, the next page is chosen
Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, the identification of the cross-page situation includes:
Adjacent previous form and next form in document are specified for this, obtains the position letter of previous form word content
Positional information, the label information of breath, label information and next form word content;
Compare the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word;
When left margin coordinate all phases of the left margin coordinate of next form each column word each column word corresponding with previous form
Meanwhile compare that next form often composes a piece of writing the page number of word and previous form is often composed a piece of writing the page number of word;And
If next form is often composed a piece of writing, often the compose a piece of writing page number of word of the page number of word and previous form exists different, judges next table
Lattice and previous form are the same form for existing cross-page situation.
Further, to achieve the above object, the present invention also provides a kind of computer-readable recording medium, the computer
Readable storage medium storing program for executing is stored with form data structuring extraction system, and the form data structuring extraction system can be by least one
Individual computing device, so that the step of at least one computing device form data structuring extracting method described above.
Compared to prior art, electronic equipment proposed by the invention, form data structuring extracting method and computer
Readable storage medium storing program for executing, the positional information and label information of form word content in document (such as PDF document) are specified by analyzing, can
To identify line feed situation and the cross-page situation in form (such as PDF annual reports form), and to there is line feed situation and cross-page situation
Form structural data extraction and store.This method by pdf document without being converted into the structurings such as word, excel text
Shelves, data extraction efficiency is high, and recall rate and accuracy rate are higher under large-scale dataset, and is beneficial to subsequently laterally compare analysis, indulges
To comparison analysis and data modeling.
Brief description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of electronic equipment of the present invention;
Fig. 2 is the program module schematic diagram of the embodiment of form data structuring extraction system one in electronic equipment of the present invention;
Fig. 3 is the implementation process diagram of the embodiment of form data structuring extracting method one of the present invention.
Reference:
The realization, functional characteristics and advantage of the object of the invention will be described further referring to the drawings in conjunction with the embodiments.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not
For limiting the present invention.Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made
The every other embodiment obtained is put, belongs to the scope of protection of the invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is only used for describing purpose, and can not
It is interpreted as indicating or implies its relative importance or imply the quantity of the technical characteristic indicated by indicating.Thus, define " the
One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In addition, the skill between each embodiment
Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical scheme
With reference to occurring conflicting or will be understood that the combination of this technical scheme is not present when can not realize, also not in application claims
Protection domain within.
Explanation is needed further exist for, herein, term " comprising ", "comprising" or its any other variant are intended to contain
Lid nonexcludability includes, so that process, method, article or device including a series of elements not only will including those
Element, but also the other element including being not expressly set out, or it is this process, method, article or device also to include
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Other identical element also be present in process, method, article or device including the key element.
First, the present invention proposes a kind of electronic equipment 2.
As shown in fig.1, it is the schematic diagram of 2 one optional hardware structure of electronic equipment of the present invention.It is described in the present embodiment
Electronic equipment 2 may include, but be not limited to, and connection memory 21, processor 22, network interface can be in communication with each other by system bus
23.It is pointed out that Fig. 1 illustrate only the electronic equipment 2 with component 21-23, it should be understood that being not required for reality
All components shown are applied, the more or less component of the implementation that can be substituted.
Wherein, the electronic equipment 2 can be rack-mount server, blade server, tower server or cabinet-type
The computing devices such as server, the electronic equipment 2 can be the services that independent server or multiple servers are formed
Device cluster.
The memory 21 comprises at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited
Ask memory (SRAM), read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), programmable read-only deposit
Reservoir (PROM), magnetic storage, disk, CD etc..In certain embodiments, the memory 21 can be that the electronics is set
Standby 2 internal storage unit, such as the hard disk or internal memory of the electronic equipment 2.In further embodiments, the memory 21
Can be the plug-in type hard disk being equipped with the External memory equipment of the electronic equipment 2, such as the electronic equipment 2, intelligent storage
Block (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
Certainly, the memory 21 can also both include the internal storage unit of the electronic equipment 2 or including its External memory equipment.
In the present embodiment, the memory 21 is generally used for storing the operating system for being installed on the electronic equipment 2 and types of applications is soft
Part, such as program code of the form data structuring extraction system 20 etc..In addition, the memory 21 can be also used for temporarily
When store the Various types of data that has exported or will export.
The processor 22 can be in certain embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is generally used for controlling the electricity
The overall operation of sub- equipment 2, such as perform the control and processing related to the electronic equipment 2 progress data interaction or communication
Deng.In the present embodiment, the processor 22 is used to run the program code stored in the memory 21 or processing data, example
Form data structuring extraction system 20 as described in running.
The network interface 23 may include radio network interface or wired network interface, and the network interface 23 is generally used for
Communication connection is established between the electronic equipment 2 and other electronic equipments.For example, the network interface 23 is used to incite somebody to action by network
The electronic equipment 2 is connected with external data platform, and data biography is established between the electronic equipment 2 and external data platform
Defeated passage and communication connection.The network can be intranet (Intranet), internet (Internet), whole world movement
Communication system (Global System of Mobile communication, GSM), WCDMA (Wideband
Code Division Multiple Access, WCDMA), 4G networks, 5G networks, bluetooth (Bluetooth), the nothing such as Wi-Fi
Line or cable network.
So far, oneself is through describing the application environment of each embodiment of the present invention and the hardware configuration and work(of relevant device in detail
Energy.Below, above-mentioned application environment and relevant device will be based on, proposes each embodiment of the present invention.
As shown in fig.2, it is the program of the embodiment of form data structuring extraction system 20 1 in electronic equipment 2 of the present invention
Module map.In the present embodiment, described form data structuring extraction system 20 can be divided into one or more program moulds
Block, one or more of program modules are stored in the memory 21, and by one or more processors (this implementation
It is the processor 22 in example) it is performed, to complete the present invention.For example, in fig. 2, described form data structuring extraction
System 20 can be divided into acquisition module 201, identification module 202 and memory module 203.Program mould alleged by the present invention
Block is the series of computation machine programmed instruction section for referring to complete specific function, than program more suitable for describing the form data
Implementation procedure of the structuring extraction system 20 in the electronic equipment 2.The function of putting up with each program module 201-203 below is entered
Row is described in detail.
The acquisition module 201, for obtaining the positional information and label of word of often being composed a piece of writing in specified document (such as PDF document)
Information.In the present embodiment, this can be obtained using specific character recognition tool (such as pdf2html instruments) to specify in document
The often positional information and label information of style of writing word.PDF document can be resolved to text by the specific character recognition tool
(such as XML file), while parse the positional information and label information of every this word of often being composed a piece of writing in PDF document of style of writing.
Preferably, in the present embodiment, often the positional information of style of writing word includes, but not limited to the left margin of every style of writing word
The coordinate informations such as coordinate, upper edge coordinate, textwidth, text size.Wherein, the every a line storage for specifying form in document
In adjacent position, i.e., the positional information (such as left margin coordinate) according to word of often composing a piece of writing stores successively.Further, often compose a piece of writing word
Label information includes, but not limited to the page number of every style of writing word in the specified document (such as PDF document) (where word of often composing a piece of writing
The sequence number of the page), page length, pagewidth etc..
The identification module 202, for the positional information and label information according to word of often composing a piece of writing, the table of document is specified from this
Line feed situation and cross-page situation are identified in lattice.
Specifically, identify that line feed situation comprises the following steps A1-A2 in the form for specifying document from this.
(A1) certain table in the specified document is positioned, the positional information of the certain table is obtained, such as the certain table
Left margin coordinate, table width (form height) and form length etc..In the present embodiment, document can be specified by this
Ad hoc rules, specify the form in document to position to this.For example, it is PDF annual reports that if this, which specifies document, annual report is issued
There is clear and definite call format, certain table can be judged according to similar following annual report rule:
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION
Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table
The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other forms in PDF annual reports have similar
Form.
(A2) a line word being successively read according to the positional information of the certain table in the certain table, and according to the row
The positional information of word identifies the cell of line feed in trip from the style of writing word.In the present embodiment, can be from the specific table
The left margin coordinate of lattice starts, and the first row is read according to the form length of the certain table, and according to the form of the certain table
Width is until read last column of the certain table.
Preferably, in the present embodiment, the often style of writing word of the certain table includes multiple cells, such as the 1st cell, the
2 cells, the 3rd cell, the 4th cell.More specifically, the line feed situation includes line feed and end of line line feed in row.Institute
Line feed in row is stated to refer to:Line feed situation in the internal element lattice of a line word of the certain table be present.The end of line line feed
Refer to:Line feed situation in the tail units lattice of a line word of the certain table be present.
Preferably, in the present embodiment, the positional information according to the style of writing word is identified from the style of writing word in trip
The cell of line feed comprises the following steps A21-A22.
(A21) the word content positional information of each cell in the style of writing word is obtained, wherein, the word of each cell
The left margin coordinate, upper edge coordinate, text that content-location information includes, but not limited to the word content of each cell are wide
The coordinate informations such as degree, text size.
(A22) from the coordinate identical cell of word content upper edge, obtain for the first time occur cell location and
The cell location that last time occurs (finds cell location and the last time that identical upper edge coordinate occurs for the first time
The cell location of appearance), all lists for the cell location that the cell location that first time occurs once is occurred to the end
First lattice are defined as same a line, and judge text between the cell location of the cell location and last time appearance occurred for the first time
Coordinate different cell in word content upper edge is the interior cell to enter a new line of row.
Preferably, in other embodiments, the form data line feed identification also includes step:(A3) according to current style of writing
The positional information of word identifies the cell of end of line line feed from current line word.
Specifically, the positional information according to current line word identifies the list of end of line line feed from current line word
First lattice comprise the following steps A31-A33.
(A31) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained
Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus
The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A32) the word content upper edge coordinate and current line (such as the first row) and next line of the remaining cell are calculated
The distance of the word content upper edge coordinate of (such as the second row) all cells.
(A33) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line,
And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in next line apart from minimum value, the word content of the remaining cell is incorporated to down
A line, and judge end of line line feed cell of the remaining cell for next line.
It should be noted that in other embodiments, the positional information according to current line word is from current line word
In identify end of line line feed cell can also comprise the following steps A34-A36.
(A34) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained
Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus
The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A35) the word content upper edge coordinate and current line (such as the second row) and lastrow of the remaining cell are calculated
The distance of the word content upper edge coordinate of (such as the first row) all cells.
(A36) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line,
And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in lastrow apart from minimum value, the word content of the remaining cell is incorporated to
A line, and judge end of line line feed cell of the remaining cell for lastrow.
It should be noted that above-mentioned steps A1-A2, A21-A22, A31-A36 be with the certain table of pdf document (such as
Client supplier form) in identify form data line feed situation exemplified by illustrate, it will be understood by those skilled in the art that
In other embodiments, all forms that above table information line feed recognition methods can also be directed to pdf document carry out line feed situation
Identification, will not be repeated here.
Further, identify that cross-page situation comprises the following steps B1-B3 (methods in the form for specifying document from this
One).
(B1) adjacent previous form and next form in document are specified for this, obtains the position of previous form word content
Positional information, the label information of confidence breath, label information and next form word content.
Preferably, in the present embodiment, the positional information of the previous form word content includes, but not limited to previous
Form is often composed a piece of writing the coordinate informations such as the left margin coordinate of word, upper edge coordinate, textwidth, text size, and previous form is every
Left margin coordinate of row word etc..The label information of the previous form word content includes, but not limited to previous form and often gone
Word is wide in the page number (sequence number for the page where word of often composing a piece of writing), page length, the page of the specified document (such as PDF document)
Degree etc..
Further, the positional information of next form word content includes, but not limited to next form and often composed a piece of writing word
Left margin coordinate, upper edge coordinate, textwidth, the coordinate information such as text size, and the left side of next form each column word
Along coordinate etc..The label information of next form word content includes, but not limited to next form and often composes a piece of writing word in the finger
Determine the page number (sequence number for the page where word of often composing a piece of writing), page length, the pagewidth of document (such as PDF document).
(B2) left margin for comparing the left margin coordinate each column word corresponding with previous form of next form each column word is sat
Mark.For example, the left margin coordinate of next row word of form the 1st and the left margin coordinate of the previous row word of form the 1st are compared,
The left margin coordinate of next row word of form the 2nd and the left margin coordinate of the previous row word of form the 2nd are compared, the rest may be inferred.
(B3) when the left margin coordinate of the left margin coordinate of next form each column word each column word corresponding with previous form
(represent next form and previous form as same form) when all identical, then compare next form often compose a piece of writing word the page number with it is previous
Form is often composed a piece of writing the page number of word.For example, the page footing of first page includes previous form, and the beginning of the page of second page includes next table
Lattice, wherein, the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word is all identical,
Then judge that next form and previous form are same form.
If next form is often composed a piece of writing, often the compose a piece of writing page number of word of the page number of word and previous form exists different, judges next table
Lattice and previous form are the same form for existing cross-page situation.If next form is often composed a piece of writing, the page number of word is often composed a piece of writing with previous form
The page number of word is all identical, then judges next form and the same form that previous form is in the absence of cross-page situation, i.e., next form
It is the same form positioned at the same page with previous form.
Preferably, in the present embodiment, if the left margin coordinate of next form each column word each column corresponding with previous form
Difference between the left margin coordinate of word is both less than predetermined threshold value (such as 2 pixel unit values), then judges next form each column
The left margin coordinate of the left margin coordinate each column word corresponding with previous form of word is all identical.
It should be noted that above-mentioned steps B1-B3 (method one) is with two neighboring form (the previous table of pdf document
Lattice and next form) in identify and illustrate exemplified by the cross-page situation of form data, it will be understood by those skilled in the art that at it
In its embodiment, the cross-page identification of above table information can also be directed to pdf document certain table (such as financial form) carry out across
Page situation identification (method two), method two comprise the following steps B4-B5.
(B4) certain table in the specified document is positioned, obtains the positional information and label of the certain table word content
Information.Wherein, the positional information of the certain table word content includes, but not limited to the certain table and often composed a piece of writing the left side of word
Along coordinate informations such as coordinate, upper edge coordinate, textwidth, text sizes.The label information bag of the certain table word content
Include, but be not limited to, the certain table often composes a piece of writing the page number of the word in the specified document (such as PDF document) (where word of often composing a piece of writing
The sequence number of the page), page length, pagewidth etc..
Specifically, the ad hoc rules of document can be specified by this, specifies the certain table in document to determine to this
Position.For example, it is PDF annual reports that if this, which specifies document, annual report issue has clear and definite call format, can be according to similar following year
Report rule is judged certain table.
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION
Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table
The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other certain tables in PDF annual reports have
Similar form.
(B5) certain table is successively read according to the positional information (as above edge coordinate) of the certain table word content
Often style of writing word (as above edge coordinate identical word is same a line), and according to the label information of the certain table word content
Obtain the page number for word of often composing a piece of writing.
If the certain table is often composed a piece of writing, the page number of word is present different, judges that the certain table has cross-page situation and (known
Do not go out the certain table and be located at the previous form of the different pages and next form).If the certain table is often composed a piece of writing, the page number of word is all
It is identical, then judge that cross-page situation is not present in the certain table.
The memory module 203, for when specified from this identified in form of document line feed situation when, then according to first
Form data is carried out branch's storage (branch's extraction list data simultaneously stores) and point row storage (point row extraction form by remodeling rule
Data simultaneously store), form the list data of structuring.
Preferably, in the present embodiment, the first remodeling rule includes:Upper edge coordinate identical word is stored as
Same a line (branch stores), left margin coordinate identical word is stored as same row (point row storage).
The memory module 203, be additionally operable to when specified from this identify cross-page situation in form of document when, then according to
Form data is carried out branch's storage and point row storage by two remodeling rules, forms the list data of structuring.
Preferably, in the present embodiment, the second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced
Form splicing form (form for forming same one page);And
Upper edge coordinate identical word in the splicing form is stored as same a line (branch store), and by the spelling
Connect left margin coordinate identical word in form and be stored as same row (point row storage).
Specifically, it is described to delete the previous form institute page of the footer in face and next form place next page on the previous page
Eyebrow includes:According to the previous page and the label information of next page and the ad hoc rules of the specified document, this is specified in document
The footer scope of the previous page and the header scope of next page are positioned, and the footer scope and header determined according to positioning
Scope deletes the footer of the previous page and the header of next page.
Wherein, the label information of the previous page includes, but not limited to the page number, page length, the page of the previous page
Width etc.;The label information of the next page includes, but not limited to the page number, page length, the pagewidth of next page
Deng.This specifies the ad hoc rules of document to include, but not limited to the first ratio of page length shared by the footer of the previous page (such as
8%), the second ratio (such as 9%) of page length shared by the header of next page.It is appreciated that first ratio and second
Ratio can also be identical.
Further, the footer scope of the previous page is established rules really, is:According to order from bottom to up, institute is chosen
The first ratio content of the page length of the previous page is stated, the footer using the first ratio content of selection as the previous page
Scope.The header scope of the next page is established rules really, is:According to order from top to bottom, the next page is chosen
Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, in other embodiments, the form data structuring extraction system 20 is additionally operable to:For the knot of storage
Structure list data carries out laterally comparing analysis and longitudinal direction compares and analyzed.
Wherein, the laterally comparison analysis includes:It is different public that (such as same year) same industry is compared in same time range
The structuring list data (such as accounts receivable) of department, managed to analyze the debt situation of different company, financial condition etc.
Information.The longitudinal direction, which compares analysis, to be included:Compare same company in the range of different time (such as nearly 3 years) structuring form
Data (such as accounts receivable), to analyze the operation information such as the debt situation of the said firm, financial condition, (such as accounts receivable becomes
Change).
By said procedure module 201-203, form data structuring extraction system 20 proposed by the invention, by dividing
The positional information and label information of form word content in document (such as PDF document) are specified in analysis, can identify form (such as PDF
Annual report form) in line feed situation and cross-page situation, and to occur line feed situation and cross-page situation form carry out structuring number
According to extracting and store.For this method without pdf document is converted into the structured documents such as word, excel, data extraction efficiency is high,
Recall rate and accuracy rate are higher under large-scale dataset, and beneficial to follow-up laterally comparison is analyzed, longitudinal direction compares analysis and data are built
Mould.
In addition, the present invention also proposes a kind of form data structuring extracting method.
As shown in fig.3, it is the implementation process diagram of the embodiment of form data structuring extracting method one of the present invention.
In the present embodiment, according to different demands, the execution sequence of the step in flow chart shown in Fig. 3 can change, some steps
It can omit.
Step S31, obtain the positional information and label information for specifying word of often being composed a piece of writing in document (such as PDF document).In this reality
Apply in example, specific character recognition tool (such as pdf2html instruments) can be used to obtain the position for specifying word of often being composed a piece of writing in document
Confidence ceases and label information.The specific character recognition tool can resolve to PDF document text (such as XML file),
The positional information and label information of every this word of often being composed a piece of writing in PDF document of style of writing are parsed simultaneously.
Preferably, in the present embodiment, often the positional information of style of writing word includes, but not limited to the left margin of every style of writing word
The coordinate informations such as coordinate, upper edge coordinate, textwidth, text size.Wherein, the every a line storage for specifying form in document
In adjacent position, i.e., the positional information (such as left margin coordinate) according to word of often composing a piece of writing stores successively.Further, often compose a piece of writing word
Label information includes, but not limited to the page number of every style of writing word in the specified document (such as PDF document) (where word of often composing a piece of writing
The sequence number of the page), page length, pagewidth etc..
Step S32, according to the positional information and label information of every style of writing word, specified from this and identify and change in form of document
Market shape and cross-page situation.
Specifically, identify that line feed situation comprises the following steps A1-A2 in the form for specifying document from this.
(A1) certain table in the specified document is positioned, the positional information of the certain table is obtained, such as the certain table
Left margin coordinate, table width (form height) and form length etc..In the present embodiment, document can be specified by this
Ad hoc rules, specify the form in document to position to this.For example, it is PDF annual reports that if this, which specifies document, annual report is issued
There is clear and definite call format, certain table can be judged according to similar following annual report rule:
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION
Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table
The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other forms in PDF annual reports have similar
Form.
(A2) a line word being successively read according to the positional information of the certain table in the certain table, and according to the row
The positional information of word identifies the cell of line feed in trip from the style of writing word.In the present embodiment, can be from the specific table
The left margin coordinate of lattice starts, and the first row is read according to the form length of the certain table, and according to the form of the certain table
Width is until read last column of the certain table.
Preferably, in the present embodiment, the often style of writing word of the certain table includes multiple cells, such as the 1st cell, the
2 cells, the 3rd cell, the 4th cell.More specifically, the line feed situation includes line feed and end of line line feed in row.Institute
Line feed in row is stated to refer to:Line feed situation in the internal element lattice of a line word of the certain table be present.The end of line line feed
Refer to:Line feed situation in the tail units lattice of a line word of the certain table be present.
Preferably, in the present embodiment, the positional information according to the style of writing word is identified from the style of writing word in trip
The cell of line feed comprises the following steps A21-A22.
(A21) the word content positional information of each cell in the style of writing word is obtained, wherein, the word of each cell
The left margin coordinate, upper edge coordinate, text that content-location information includes, but not limited to the word content of each cell are wide
The coordinate informations such as degree, text size.
(A22) from the coordinate identical cell of word content upper edge, obtain for the first time occur cell location and
The cell location that last time occurs (finds cell location and the last time that identical upper edge coordinate occurs for the first time
The cell location of appearance), all lists for the cell location that the cell location that first time occurs once is occurred to the end
First lattice are defined as same a line, and judge text between the cell location of the cell location and last time appearance occurred for the first time
Coordinate different cell in word content upper edge is the interior cell to enter a new line of row.
Preferably, in other embodiments, the form data line feed recognition methods also includes step:(A3) according to current
The positional information of style of writing word identifies the cell of end of line line feed from current line word.
Specifically, the positional information according to current line word identifies the list of end of line line feed from current line word
First lattice comprise the following steps A31-A33.
(A31) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained
Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus
The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A32) the word content upper edge coordinate and current line (such as the first row) and next line of the remaining cell are calculated
The distance of the word content upper edge coordinate of (such as the second row) all cells.
(A33) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line,
And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in next line apart from minimum value, the word content of the remaining cell is incorporated to down
A line, and judge end of line line feed cell of the remaining cell for next line.
It should be noted that in other embodiments, the positional information according to current line word is from current line word
In identify end of line line feed cell can also comprise the following steps A34-A36.
(A34) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained
Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus
The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A35) the word content upper edge coordinate and current line (such as the second row) and lastrow of the remaining cell are calculated
The distance of the word content upper edge coordinate of (such as the first row) all cells.
(A36) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line,
And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in lastrow apart from minimum value, the word content of the remaining cell is incorporated to
A line, and judge end of line line feed cell of the remaining cell for lastrow.
It should be noted that above-mentioned steps A1-A2, A21-A22, A31-A36 be with the certain table of pdf document (such as
Client supplier form) in identify form data line feed situation exemplified by illustrate, it will be understood by those skilled in the art that
In other embodiments, all forms that above table information line feed recognition methods can also be directed to pdf document carry out line feed situation
Identification, will not be repeated here.
Further, identify that cross-page situation comprises the following steps B1-B3 (methods in the form for specifying document from this
One).
(B1) adjacent previous form and next form in document are specified for this, obtains the position of previous form word content
Positional information, the label information of confidence breath, label information and next form word content.
Preferably, in the present embodiment, the positional information of the previous form word content includes, but not limited to previous
Form is often composed a piece of writing the coordinate informations such as the left margin coordinate of word, upper edge coordinate, textwidth, text size, and previous form is every
Left margin coordinate of row word etc..The label information of the previous form word content includes, but not limited to previous form and often gone
Word is wide in the page number (sequence number for the page where word of often composing a piece of writing), page length, the page of the specified document (such as PDF document)
Degree etc..
Further, the positional information of next form word content includes, but not limited to next form and often composed a piece of writing word
Left margin coordinate, upper edge coordinate, textwidth, the coordinate information such as text size, and the left side of next form each column word
Along coordinate etc..The label information of next form word content includes, but not limited to next form and often composes a piece of writing word in the finger
Determine the page number (sequence number for the page where word of often composing a piece of writing), page length, the pagewidth of document (such as PDF document).
(B2) left margin for comparing the left margin coordinate each column word corresponding with previous form of next form each column word is sat
Mark.For example, the left margin coordinate of next row word of form the 1st and the left margin coordinate of the previous row word of form the 1st are compared,
The left margin coordinate of next row word of form the 2nd and the left margin coordinate of the previous row word of form the 2nd are compared, the rest may be inferred.
(B3) when the left margin coordinate of the left margin coordinate of next form each column word each column word corresponding with previous form
(represent next form and previous form as same form) when all identical, then compare next form often compose a piece of writing word the page number with it is previous
Form is often composed a piece of writing the page number of word.For example, the page footing of first page includes previous form, and the beginning of the page of second page includes next table
Lattice, wherein, the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word is all identical,
Then judge that next form and previous form are same form.
If next form is often composed a piece of writing, often the compose a piece of writing page number of word of the page number of word and previous form exists different, judges next table
Lattice and previous form are the same form for existing cross-page situation.If next form is often composed a piece of writing, the page number of word is often composed a piece of writing with previous form
The page number of word is all identical, then judges next form and the same form that previous form is in the absence of cross-page situation, i.e., next form
It is the same form positioned at the same page with previous form.
Preferably, in the present embodiment, if the left margin coordinate of next form each column word each column corresponding with previous form
Difference between the left margin coordinate of word is both less than predetermined threshold value (such as 2 pixel unit values), then judges next form each column
The left margin coordinate of the left margin coordinate each column word corresponding with previous form of word is all identical.
It should be noted that above-mentioned steps B1-B3 (method one) is with two neighboring form (the previous table of pdf document
Lattice and next form) in identify and illustrate exemplified by the cross-page situation of form data, it will be understood by those skilled in the art that at it
In its embodiment, the certain table (such as financial form) that the cross-page recognition methods of above table information can also be directed to pdf document is entered
The cross-page situation identification (method two) of row, method two comprise the following steps B4-B5.
(B4) certain table in the specified document is positioned, obtains the positional information and label of the certain table word content
Information.Wherein, the positional information of the certain table word content includes, but not limited to the certain table and often composed a piece of writing the left side of word
Along coordinate informations such as coordinate, upper edge coordinate, textwidth, text sizes.The label information bag of the certain table word content
Include, but be not limited to, the certain table often composes a piece of writing the page number of the word in the specified document (such as PDF document) (where word of often composing a piece of writing
The sequence number of the page), page length, pagewidth etc..
Specifically, the ad hoc rules of document can be specified by this, specifies the certain table in document to determine to this
Position.For example, it is PDF annual reports that if this, which specifies document, annual report issue has clear and definite call format, can be according to similar following year
Report rule is judged certain table.
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION
Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table
The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other certain tables in PDF annual reports have
Similar form.
(B5) certain table is successively read according to the positional information (as above edge coordinate) of the certain table word content
Often style of writing word (as above edge coordinate identical word is same a line), and according to the label information of the certain table word content
Obtain the page number for word of often composing a piece of writing.
If the certain table is often composed a piece of writing, the page number of word is present different, judges that the certain table has cross-page situation and (known
Do not go out the certain table and be located at the previous form of the different pages and next form).If the certain table is often composed a piece of writing, the page number of word is all
It is identical, then judge that cross-page situation is not present in the certain table.
Step S33, when specified from this line feed situation is identified in form of document when, then according to the first remodeling rule by table
Lattice information carries out branch's storage (branch's extraction list data simultaneously stores) and point row storage (point row extraction list data simultaneously stores),
Form the list data of structuring.
Preferably, in the present embodiment, the first remodeling rule includes:Upper edge coordinate identical word is stored as
Same a line (branch stores), left margin coordinate identical word is stored as same row (point row storage).
Step S34, when specified from this identify cross-page situation in form of document when, then according to the second remodeling rule by table
Lattice information carries out branch's storage and point row storage, forms the list data of structuring.
Preferably, in the present embodiment, the second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced
Form splicing form (form for forming same one page);And
Upper edge coordinate identical word in the splicing form is stored as same a line (branch store), and by the spelling
Connect left margin coordinate identical word in form and be stored as same row (point row storage).
Specifically, it is described to delete the previous form institute page of the footer in face and next form place next page on the previous page
Eyebrow includes:According to the previous page and the label information of next page and the ad hoc rules of the specified document, this is specified in document
The footer scope of the previous page and the header scope of next page are positioned, and the footer scope and header determined according to positioning
Scope deletes the footer of the previous page and the header of next page.
Wherein, the label information of the previous page includes, but not limited to the page number, page length, the page of the previous page
Width etc.;The label information of the next page includes, but not limited to the page number, page length, the pagewidth of next page
Deng.This specifies the ad hoc rules of document to include, but not limited to the first ratio of page length shared by the footer of the previous page (such as
8%), the second ratio (such as 9%) of page length shared by the header of next page.It is appreciated that first ratio and second
Ratio can also be identical.
Further, the footer scope of the previous page is established rules really, is:According to order from bottom to up, institute is chosen
The first ratio content of the page length of the previous page is stated, the footer using the first ratio content of selection as the previous page
Scope.The header scope of the next page is established rules really, is:According to order from top to bottom, the next page is chosen
Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, in other embodiments, the form data structuring extracting method also includes step:For storage
Structuring list data carries out laterally comparing analysis and longitudinal direction compares and analyzed.
Wherein, the laterally comparison analysis includes:It is different public that (such as same year) same industry is compared in same time range
The structuring list data (such as accounts receivable) of department, managed to analyze the debt situation of different company, financial condition etc.
Information.The longitudinal direction, which compares analysis, to be included:Compare same company in the range of different time (such as nearly 3 years) structuring form
Data (such as accounts receivable), to analyze the operation information such as the debt situation of the said firm, financial condition, (such as accounts receivable becomes
Change).
By above-mentioned steps S31-S34 and other correlation steps, form data structuring extraction side proposed by the invention
Method, the positional information and label information of form word content in document (such as PDF document) are specified by analyzing, table can be identified
Line feed situation and cross-page situation in lattice (such as PDF annual reports form), and carried out to there is the form of line feed situation and cross-page situation
Structural data is extracted and stored.This method by pdf document without being converted into the structured documents such as word, excel, data extraction
Efficiency high, recall rate and accuracy rate are higher under large-scale dataset, and beneficial to it is follow-up laterally compare analysis, longitudinal direction compares analysis and
Data modeling.
Further, to achieve the above object, the present invention also provide a kind of computer-readable recording medium (such as ROM/RAM,
Magnetic disc, CD), the computer-readable recording medium storage has form data structuring extraction system 20, the form data
Structuring extraction system 20 can be performed by least one processor 22, so that at least one processor 22 performs as described above
Form data structuring extracting method the step of.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to realized by hardware, but a lot
In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing
The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage
In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate
Machine, server, air conditioner, or network equipment etc.) perform method described in each embodiment of the present invention.
Above by reference to the preferred embodiments of the present invention have been illustrated, not thereby limit to the interest field of the present invention.On
State that sequence number of the embodiment of the present invention is for illustration only, do not represent the quality of embodiment.Patrolled in addition, though showing in flow charts
Order is collected, but in some cases, can be with the step shown or described by being performed different from order herein.
Those skilled in the art do not depart from the scope of the present invention and essence, can have a variety of flexible programs to realize the present invention,
It can be used for another embodiment for example as the feature of one embodiment and obtain another embodiment.It is every to utilize description of the invention
And the equivalent structure made of accompanying drawing content or equivalent flow conversion, or other related technical areas are directly or indirectly used in,
It is included within the scope of the present invention.
Claims (10)
1. a kind of electronic equipment, it is characterised in that the electronic equipment includes memory and processor, is stored on the memory
There is the form data structuring extraction system that can be run on the processor, the form data structuring extraction system is by institute
Following steps are realized when stating computing device:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to the positional information and label information of every style of writing word, specified from this and line feed situation and cross-page is identified in form of document
Situation;
When specified from this line feed situation is identified in form of document when, then form data divided according to the first remodeling rule
Row storage and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data divided according to the second remodeling rule
Row storage and point row storage.
2. electronic equipment as claimed in claim 1, it is characterised in that the first remodeling rule includes:By upper edge coordinate
Identical word is stored as same a line, and left margin coordinate identical word is stored as into same row;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced to form
Splice form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and splices left margin in form by described
Coordinate identical word is stored as same row.
3. electronic equipment as claimed in claim 2, it is characterised in that it is described delete previous form face on the previous page footer
Include with the header of next page where next form:
According to the previous page and the label information of next page and this specify document ad hoc rules, before being specified to this in document
The footer scope of one page and the header scope of next page are positioned, and the footer scope and header model determined according to positioning
Enclose and delete the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the previous page is chosen
First ratio content of page length, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:The order of foundation from top to bottom, choose the page of the next page
Second ratio content of length, the header scope using the second ratio content of selection as the next page.
4. electronic equipment as claimed in claim 2, it is characterised in that the line feed situation includes line feed and end of line in row and changed
OK;
Wherein, the identification of line feed includes in the row:
The word content positional information of each cell in the style of writing word is obtained, wherein, the word content position of each cell
Information includes the upper edge coordinate of the word content of each cell;And
From the coordinate identical cell of word content upper edge, obtain the cell location occurred for the first time and last time goes out
Existing cell location, all cells for the cell location that the cell location that first time occurs once is occurred to the end
It is defined as same a line, and judges between the cell location occurred for the first time and the cell location occurred for the last time in word
Hold cell of the coordinate different cell in upper edge for line feed in row.
5. electronic equipment as claimed in claim 4, it is characterised in that the identification of the end of line line feed includes:
If remaining cell be present in current line word after line feed identification in row, the word content of the remaining cell is obtained
Positional information, wherein, the word content positional information of the remaining cell includes the word content of the remaining cell
Upper edge coordinate;
Calculate in the word content upper edge coordinate and the word of current line and all cells of next line of the remaining cell
Hold the distance of upper edge coordinate, or calculate word content upper edge coordinate and current line and the lastrow institute of the remaining cell
There is the distance of the word content upper edge coordinate of cell;And
If appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, and judge institute
State the end of line line feed cell that remaining cell is current line.
A kind of 6. form data structuring extracting method, applied to electronic equipment, it is characterised in that methods described includes:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to the positional information and label information of every style of writing word, specified from this and line feed situation and cross-page is identified in form of document
Situation;
When specified from this line feed situation is identified in form of document when, then form data divided according to the first remodeling rule
Row storage and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data divided according to the second remodeling rule
Row storage and point row storage.
7. form data structuring extracting method as claimed in claim 6, it is characterised in that the first remodeling rule bag
Include:Upper edge coordinate identical word is stored as same a line, left margin coordinate identical word is stored as same row;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced to form
Splice form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and splices left margin in form by described
Coordinate identical word is stored as same row.
8. form data structuring extracting method as claimed in claim 7, it is characterised in that described to delete previous form place
The header of next page includes where the footer of the previous page and next form:
According to the previous page and the label information of next page and this specify document ad hoc rules, before being specified to this in document
The footer scope of one page and the header scope of next page are positioned, and the footer scope and header model determined according to positioning
Enclose and delete the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the previous page is chosen
First ratio content of page length, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:The order of foundation from top to bottom, choose the page of the next page
Second ratio content of length, the header scope using the second ratio content of selection as the next page.
9. form data structuring extracting method as claimed in claim 7, it is characterised in that the identification bag of the cross-page situation
Include:
Specify adjacent previous form and next form in document for this, obtain previous form word content positional information,
The positional information of label information and next form word content, label information;
Compare the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word;
When the left margin coordinate of the left margin coordinate of next form each column word each column word corresponding with previous form is all identical,
Compare that next form often composes a piece of writing the page number of word and previous form is often composed a piece of writing the page number of word;And
If next form is often composed a piece of writing the page number of word and previous form often compose a piece of writing word the page number exist it is different, judge next form and
Previous form is the same form for existing cross-page situation.
10. a kind of computer-readable recording medium, the computer-readable recording medium storage has form data structuring extraction
System, the form data structuring extraction system can be by least one computing devices, so that at least one processor
The step of performing the form data structuring extracting method as any one of claim 6-9.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710962303.5A CN107818075A (en) | 2017-10-16 | 2017-10-16 | Form data structuring extracting method, electronic equipment and computer-readable recording medium |
PCT/CN2018/076167 WO2019075969A1 (en) | 2017-10-16 | 2018-02-10 | Method for extracting form information in a structured manner, electronic device, and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710962303.5A CN107818075A (en) | 2017-10-16 | 2017-10-16 | Form data structuring extracting method, electronic equipment and computer-readable recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107818075A true CN107818075A (en) | 2018-03-20 |
Family
ID=61608392
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710962303.5A Pending CN107818075A (en) | 2017-10-16 | 2017-10-16 | Form data structuring extracting method, electronic equipment and computer-readable recording medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107818075A (en) |
WO (1) | WO2019075969A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109002425A (en) * | 2018-06-19 | 2018-12-14 | 平安科技(深圳)有限公司 | Acquisition methods, terminal device and the medium of enterprise's upstream-downstream relationship |
CN109062874A (en) * | 2018-06-12 | 2018-12-21 | 平安科技(深圳)有限公司 | Acquisition methods, terminal device and the medium of financial data |
CN109522538A (en) * | 2018-11-28 | 2019-03-26 | 腾讯科技(深圳)有限公司 | Table content divides column method, apparatus, equipment and storage medium automatically |
CN109542898A (en) * | 2018-10-30 | 2019-03-29 | 天津字节跳动科技有限公司 | Date storage method, device, electronic equipment and the storage medium of data bank table |
CN109871524A (en) * | 2019-02-21 | 2019-06-11 | 腾讯科技(深圳)有限公司 | A kind of chart generation method and device |
CN110032718A (en) * | 2019-04-12 | 2019-07-19 | 广州广燃设计有限公司 | A kind of table conversion method, system and storage medium |
CN110489423A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus of information extraction, storage medium and electronic equipment |
CN110489424A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction |
CN110909123A (en) * | 2019-10-23 | 2020-03-24 | 深圳价值在线信息科技股份有限公司 | Data extraction method and device, terminal equipment and storage medium |
CN112287660A (en) * | 2019-12-04 | 2021-01-29 | 上海柯林布瑞信息技术有限公司 | Method and device for analyzing table in PDF file, computing equipment and storage medium |
CN112380825A (en) * | 2020-11-17 | 2021-02-19 | 平安科技(深圳)有限公司 | PDF document page-crossing table merging method and device, electronic equipment and storage medium |
CN112464626A (en) * | 2020-12-09 | 2021-03-09 | 上海携宁计算机科技股份有限公司 | Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium |
CN112632927A (en) * | 2020-12-30 | 2021-04-09 | 上海犀语科技有限公司 | Table fragment link restoration method and system based on semantic processing |
CN112651331A (en) * | 2020-12-24 | 2021-04-13 | 万兴科技集团股份有限公司 | Text table extraction method, system, computer device and storage medium |
CN113111864A (en) * | 2021-05-13 | 2021-07-13 | 上海巽联信息科技有限公司 | Intelligent table extraction algorithm based on multiple modes |
CN113361257A (en) * | 2021-06-29 | 2021-09-07 | 深圳壹账通智能科技有限公司 | PDF document analysis method, system, electronic device and storage medium |
CN113869014A (en) * | 2021-08-25 | 2021-12-31 | 盐城金堤科技有限公司 | Extraction method and device of table data, storage medium and electronic equipment |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11436249B1 (en) | 2021-03-26 | 2022-09-06 | International Business Machines Corporation | Transformation of composite tables into structured database content |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102508826A (en) * | 2011-11-03 | 2012-06-20 | 汉王科技股份有限公司 | Method and device for displaying table in document |
CN102722475A (en) * | 2012-05-09 | 2012-10-10 | 深圳市万兴软件有限公司 | Method for converting form in portable document format (PDF) document into Excel form |
CN102855232A (en) * | 2012-09-14 | 2013-01-02 | 同方光盘股份有限公司 | Table analysis and edit processing method |
CN106951400A (en) * | 2017-02-06 | 2017-07-14 | 北京因果树网络科技有限公司 | The information extraction method and device of a kind of pdf document |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090282009A1 (en) * | 2008-05-09 | 2009-11-12 | Tags Ltd | System, method, and program product for automated grading |
US20120265759A1 (en) * | 2011-04-15 | 2012-10-18 | Xerox Corporation | File processing of native file formats |
CN104268127B (en) * | 2014-09-22 | 2018-02-09 | 同方知网(北京)技术有限公司 | A kind of method of electronics shelves layout files reading order analysis |
-
2017
- 2017-10-16 CN CN201710962303.5A patent/CN107818075A/en active Pending
-
2018
- 2018-02-10 WO PCT/CN2018/076167 patent/WO2019075969A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102508826A (en) * | 2011-11-03 | 2012-06-20 | 汉王科技股份有限公司 | Method and device for displaying table in document |
CN102722475A (en) * | 2012-05-09 | 2012-10-10 | 深圳市万兴软件有限公司 | Method for converting form in portable document format (PDF) document into Excel form |
CN102855232A (en) * | 2012-09-14 | 2013-01-02 | 同方光盘股份有限公司 | Table analysis and edit processing method |
CN106951400A (en) * | 2017-02-06 | 2017-07-14 | 北京因果树网络科技有限公司 | The information extraction method and device of a kind of pdf document |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062874A (en) * | 2018-06-12 | 2018-12-21 | 平安科技(深圳)有限公司 | Acquisition methods, terminal device and the medium of financial data |
CN109062874B (en) * | 2018-06-12 | 2022-03-04 | 平安科技(深圳)有限公司 | Financial data acquisition method, terminal device and medium |
WO2019237540A1 (en) * | 2018-06-12 | 2019-12-19 | 平安科技(深圳)有限公司 | Method and device for acquiring financial data, terminal device, and medium |
CN109002425A (en) * | 2018-06-19 | 2018-12-14 | 平安科技(深圳)有限公司 | Acquisition methods, terminal device and the medium of enterprise's upstream-downstream relationship |
CN109002425B (en) * | 2018-06-19 | 2022-03-22 | 平安科技(深圳)有限公司 | Method for acquiring upstream and downstream relations of enterprise, terminal device and medium |
CN109542898A (en) * | 2018-10-30 | 2019-03-29 | 天津字节跳动科技有限公司 | Date storage method, device, electronic equipment and the storage medium of data bank table |
US11487935B2 (en) | 2018-11-28 | 2022-11-01 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for automatically splitting table content into columns, computer device, and storage medium |
CN109522538B (en) * | 2018-11-28 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Automatic listing method, device, equipment and storage medium for table contents |
WO2020108257A1 (en) * | 2018-11-28 | 2020-06-04 | 腾讯科技(深圳)有限公司 | Method and device for automatically splitting table content into columns, computer apparatus, and storage medium |
CN109522538A (en) * | 2018-11-28 | 2019-03-26 | 腾讯科技(深圳)有限公司 | Table content divides column method, apparatus, equipment and storage medium automatically |
CN109871524A (en) * | 2019-02-21 | 2019-06-11 | 腾讯科技(深圳)有限公司 | A kind of chart generation method and device |
CN110032718B (en) * | 2019-04-12 | 2023-04-18 | 广州广燃设计有限公司 | Table conversion method, system and storage medium |
CN110032718A (en) * | 2019-04-12 | 2019-07-19 | 广州广燃设计有限公司 | A kind of table conversion method, system and storage medium |
CN110489424A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction |
CN110489423B (en) * | 2019-08-26 | 2021-10-08 | 北京香侬慧语科技有限责任公司 | Information extraction method and device, storage medium and electronic equipment |
CN110489423A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus of information extraction, storage medium and electronic equipment |
CN110909123A (en) * | 2019-10-23 | 2020-03-24 | 深圳价值在线信息科技股份有限公司 | Data extraction method and device, terminal equipment and storage medium |
CN110909123B (en) * | 2019-10-23 | 2023-08-25 | 深圳价值在线信息科技股份有限公司 | Data extraction method and device, terminal equipment and storage medium |
CN112287660A (en) * | 2019-12-04 | 2021-01-29 | 上海柯林布瑞信息技术有限公司 | Method and device for analyzing table in PDF file, computing equipment and storage medium |
CN112380825A (en) * | 2020-11-17 | 2021-02-19 | 平安科技(深圳)有限公司 | PDF document page-crossing table merging method and device, electronic equipment and storage medium |
WO2022105172A1 (en) * | 2020-11-17 | 2022-05-27 | 平安科技(深圳)有限公司 | Pdf document cross-page table merging method and apparatus, electronic device and storage medium |
CN112380825B (en) * | 2020-11-17 | 2022-07-15 | 平安科技(深圳)有限公司 | PDF document cross-page table merging method and device, electronic equipment and storage medium |
CN112464626A (en) * | 2020-12-09 | 2021-03-09 | 上海携宁计算机科技股份有限公司 | Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium |
CN112464626B (en) * | 2020-12-09 | 2022-04-01 | 上海携宁计算机科技股份有限公司 | Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium |
CN112651331A (en) * | 2020-12-24 | 2021-04-13 | 万兴科技集团股份有限公司 | Text table extraction method, system, computer device and storage medium |
CN112651331B (en) * | 2020-12-24 | 2024-04-16 | 万兴科技集团股份有限公司 | Text form extraction method, system, computer device and storage medium |
CN112632927A (en) * | 2020-12-30 | 2021-04-09 | 上海犀语科技有限公司 | Table fragment link restoration method and system based on semantic processing |
CN113111864A (en) * | 2021-05-13 | 2021-07-13 | 上海巽联信息科技有限公司 | Intelligent table extraction algorithm based on multiple modes |
CN113361257A (en) * | 2021-06-29 | 2021-09-07 | 深圳壹账通智能科技有限公司 | PDF document analysis method, system, electronic device and storage medium |
CN113869014A (en) * | 2021-08-25 | 2021-12-31 | 盐城金堤科技有限公司 | Extraction method and device of table data, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
WO2019075969A1 (en) | 2019-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107818075A (en) | Form data structuring extracting method, electronic equipment and computer-readable recording medium | |
CN107832676A (en) | Form data line feed recognition methods, electronic equipment and computer-readable recording medium | |
CN107844468A (en) | The cross-page recognition methods of form data, electronic equipment and computer-readable recording medium | |
CN106156239B (en) | Table extraction method and device | |
CN107688789A (en) | Document charts abstracting method, electronic equipment and computer-readable recording medium | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN107689070A (en) | Chart data structuring extracting method, electronic equipment and computer-readable recording medium | |
CN107679084A (en) | Cluster labels generation method, electronic equipment and computer-readable recording medium | |
CN104239298A (en) | Text message recommendation method, server, browser and system | |
CN108509424A (en) | Institutional information processing method, device, computer equipment and storage medium | |
CN114238575A (en) | Document parsing method, system, computer device and computer-readable storage medium | |
CN108038120A (en) | Collaborative filtering recommending method, electronic equipment and computer-readable recording medium | |
CN111191079A (en) | Document content acquisition method, device, equipment and storage medium | |
CN109828756A (en) | The method and electronic device of the code of insurance page are generated based on wechat small routine | |
CN110020312A (en) | The method and apparatus for extracting Web page text | |
CN105320734A (en) | Web page core content extraction method | |
CN110516048A (en) | The extracting method, equipment and storage medium of list data in pdf document | |
CN111369294B (en) | Software cost estimation method and device | |
CN103942211A (en) | Text page recognition method and device | |
CN107766322A (en) | Entity recognition method, electronic equipment and computer-readable recording medium of the same name | |
CN106777281A (en) | For improving web crawlers stability, the data processing method of availability and device | |
CN109710224A (en) | Page processing method, device, equipment and storage medium | |
CN105589918A (en) | Method and device for extracting page information | |
CN107832374A (en) | Construction method, electronic installation and the storage medium in standard knowledge storehouse | |
CN111679825A (en) | Cascading style sheet generation method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |