CN107818075A - Form data structuring extracting method, electronic equipment and computer-readable recording medium - Google Patents

Form data structuring extracting method, electronic equipment and computer-readable recording medium Download PDF

Info

Publication number
CN107818075A
CN107818075A CN201710962303.5A CN201710962303A CN107818075A CN 107818075 A CN107818075 A CN 107818075A CN 201710962303 A CN201710962303 A CN 201710962303A CN 107818075 A CN107818075 A CN 107818075A
Authority
CN
China
Prior art keywords
word
page
previous
document
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710962303.5A
Other languages
Chinese (zh)
Inventor
苏晓明
汪伟
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710962303.5A priority Critical patent/CN107818075A/en
Priority to PCT/CN2018/076167 priority patent/WO2019075969A1/en
Publication of CN107818075A publication Critical patent/CN107818075A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets

Abstract

The invention discloses a kind of form data structuring extracting method, the method comprising the steps of:Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;According to the positional information and label information of every style of writing word, specified from this and line feed situation and cross-page situation are identified in form of document;When specified from this identified in form of document line feed situation when, then according to first remodeling rule by form data carry out branch's storage and point row storage;When specified from this identify cross-page situation in form of document when, then according to second remodeling rule by form data carry out branch's storage and point row storage.The present invention can carry out structural data extraction and storage.

Description

Form data structuring extracting method, electronic equipment and computer-readable recording medium
Technical field
The present invention relates to computer information technology field, more particularly to a kind of form data structuring extracting method, electronics Equipment and computer-readable recording medium.
Background technology
The existing form data being directed in PDF annual reports extracts, and is generally based on OCR technique.But there is line feed, changing Under page, spcial character disturbed condition, OCR technique can not reduce and remold original form data, and further structuring is integrated, And the difficulty in understanding is caused to user, also it is unfavorable for follow-up information and compares.Therefore form data extraction of the prior art Method design is not reasonable, needs improvement badly.
The content of the invention
In view of this, the present invention proposes a kind of form data structuring extracting method, electronic equipment and computer-readable deposited Storage media, the positional information and label information of form word content in document (such as PDF document) are specified by analyzing, can be identified The line feed situation and cross-page situation gone out in form (such as PDF annual reports form), and to there is the form of line feed situation and cross-page situation Carry out structural data extraction and store.
First, to achieve the above object, the present invention proposes a kind of electronic equipment, and the electronic equipment includes memory and place Device is managed, the form data structuring extraction system that can be run on the processor, the form are stored with the memory Following steps are realized when message structure extraction system is by the computing device:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to it is every style of writing word positional information and label information, specified from this identified in form of document line feed situation and Cross-page situation;
When specified from this line feed situation is identified in form of document when, then form data entered according to the first remodeling rule Row branch stores and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data entered according to the second remodeling rule Row branch stores and point row storage.
Preferably, the first remodeling rule includes:Upper edge coordinate identical word is stored as same a line, by the left side Same row is stored as along coordinate identical word;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced Form splicing form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and will be left in the splicing form Edge coordinate identical word is stored as same row.
Preferably, it is described to delete the previous form institute header of the footer in face and next form place next page on the previous page Including:
According to the previous page and the label information of next page and the ad hoc rules of the specified document, document is specified to this In the footer scope of the previous page and the header scope of next page positioned, and the footer scope and page determined according to positioning Eyebrow scope deletes the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the prevpage is chosen First ratio content of the page length in face, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:According to order from top to bottom, the next page is chosen Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, the line feed situation includes line feed and end of line line feed in row;
Wherein, the identification of line feed includes in the row:
The word content positional information of each cell in the style of writing word is obtained, wherein, the word content of each cell Positional information includes the upper edge coordinate of the word content of each cell;And
From the coordinate identical cell of word content upper edge, the cell location and last occurred for the first time is obtained The cell location of secondary appearance, all lists for the cell location that the cell location that first time occurs once is occurred to the end First lattice are defined as same a line, and judge text between the cell location of the cell location and last time appearance occurred for the first time Coordinate different cell in word content upper edge is the interior cell to enter a new line of row.
Preferably, the identification of the end of line line feed includes:
If remaining cell be present in current line word after line feed identification in row, the word of the remaining cell is obtained Content-location information, wherein, the word content positional information of the remaining cell is included in the word of the remaining cell The upper edge coordinate of appearance;
Calculate the word content upper edge coordinate and current line and the text of all cells of next line of the remaining cell The distance of word content upper edge coordinate, or calculate word content upper edge coordinate and current line and upper one of the remaining cell The distance of the word content upper edge coordinate of all cells of row;And
If appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, and sentence The fixed remaining cell is the end of line line feed cell of current line.
In addition, to achieve the above object, the present invention also provides a kind of form data structuring extracting method, this method application In electronic equipment, methods described includes:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to it is every style of writing word positional information and label information, specified from this identified in form of document line feed situation and Cross-page situation;
When specified from this line feed situation is identified in form of document when, then form data entered according to the first remodeling rule Row branch stores and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data entered according to the second remodeling rule Row branch stores and point row storage.
Preferably, the first remodeling rule includes:Upper edge coordinate identical word is stored as same a line, by the left side Same row is stored as along coordinate identical word;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced Form splicing form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and will be left in the splicing form Edge coordinate identical word is stored as same row.
Preferably, it is described to delete the previous form institute header of the footer in face and next form place next page on the previous page Including:
According to the previous page and the label information of next page and the ad hoc rules of the specified document, document is specified to this In the footer scope of the previous page and the header scope of next page positioned, and the footer scope and page determined according to positioning Eyebrow scope deletes the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the prevpage is chosen First ratio content of the page length in face, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:According to order from top to bottom, the next page is chosen Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, the identification of the cross-page situation includes:
Adjacent previous form and next form in document are specified for this, obtains the position letter of previous form word content Positional information, the label information of breath, label information and next form word content;
Compare the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word;
When left margin coordinate all phases of the left margin coordinate of next form each column word each column word corresponding with previous form Meanwhile compare that next form often composes a piece of writing the page number of word and previous form is often composed a piece of writing the page number of word;And
If next form is often composed a piece of writing, often the compose a piece of writing page number of word of the page number of word and previous form exists different, judges next table Lattice and previous form are the same form for existing cross-page situation.
Further, to achieve the above object, the present invention also provides a kind of computer-readable recording medium, the computer Readable storage medium storing program for executing is stored with form data structuring extraction system, and the form data structuring extraction system can be by least one Individual computing device, so that the step of at least one computing device form data structuring extracting method described above.
Compared to prior art, electronic equipment proposed by the invention, form data structuring extracting method and computer Readable storage medium storing program for executing, the positional information and label information of form word content in document (such as PDF document) are specified by analyzing, can To identify line feed situation and the cross-page situation in form (such as PDF annual reports form), and to there is line feed situation and cross-page situation Form structural data extraction and store.This method by pdf document without being converted into the structurings such as word, excel text Shelves, data extraction efficiency is high, and recall rate and accuracy rate are higher under large-scale dataset, and is beneficial to subsequently laterally compare analysis, indulges To comparison analysis and data modeling.
Brief description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of electronic equipment of the present invention;
Fig. 2 is the program module schematic diagram of the embodiment of form data structuring extraction system one in electronic equipment of the present invention;
Fig. 3 is the implementation process diagram of the embodiment of form data structuring extracting method one of the present invention.
Reference:
The realization, functional characteristics and advantage of the object of the invention will be described further referring to the drawings in conjunction with the embodiments.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment obtained is put, belongs to the scope of protection of the invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is only used for describing purpose, and can not It is interpreted as indicating or implies its relative importance or imply the quantity of the technical characteristic indicated by indicating.Thus, define " the One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical scheme With reference to occurring conflicting or will be understood that the combination of this technical scheme is not present when can not realize, also not in application claims Protection domain within.
Explanation is needed further exist for, herein, term " comprising ", "comprising" or its any other variant are intended to contain Lid nonexcludability includes, so that process, method, article or device including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or device also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or device including the key element.
First, the present invention proposes a kind of electronic equipment 2.
As shown in fig.1, it is the schematic diagram of 2 one optional hardware structure of electronic equipment of the present invention.It is described in the present embodiment Electronic equipment 2 may include, but be not limited to, and connection memory 21, processor 22, network interface can be in communication with each other by system bus 23.It is pointed out that Fig. 1 illustrate only the electronic equipment 2 with component 21-23, it should be understood that being not required for reality All components shown are applied, the more or less component of the implementation that can be substituted.
Wherein, the electronic equipment 2 can be rack-mount server, blade server, tower server or cabinet-type The computing devices such as server, the electronic equipment 2 can be the services that independent server or multiple servers are formed Device cluster.
The memory 21 comprises at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited Ask memory (SRAM), read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), programmable read-only deposit Reservoir (PROM), magnetic storage, disk, CD etc..In certain embodiments, the memory 21 can be that the electronics is set Standby 2 internal storage unit, such as the hard disk or internal memory of the electronic equipment 2.In further embodiments, the memory 21 Can be the plug-in type hard disk being equipped with the External memory equipment of the electronic equipment 2, such as the electronic equipment 2, intelligent storage Block (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc.. Certainly, the memory 21 can also both include the internal storage unit of the electronic equipment 2 or including its External memory equipment. In the present embodiment, the memory 21 is generally used for storing the operating system for being installed on the electronic equipment 2 and types of applications is soft Part, such as program code of the form data structuring extraction system 20 etc..In addition, the memory 21 can be also used for temporarily When store the Various types of data that has exported or will export.
The processor 22 can be in certain embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is generally used for controlling the electricity The overall operation of sub- equipment 2, such as perform the control and processing related to the electronic equipment 2 progress data interaction or communication Deng.In the present embodiment, the processor 22 is used to run the program code stored in the memory 21 or processing data, example Form data structuring extraction system 20 as described in running.
The network interface 23 may include radio network interface or wired network interface, and the network interface 23 is generally used for Communication connection is established between the electronic equipment 2 and other electronic equipments.For example, the network interface 23 is used to incite somebody to action by network The electronic equipment 2 is connected with external data platform, and data biography is established between the electronic equipment 2 and external data platform Defeated passage and communication connection.The network can be intranet (Intranet), internet (Internet), whole world movement Communication system (Global System of Mobile communication, GSM), WCDMA (Wideband Code Division Multiple Access, WCDMA), 4G networks, 5G networks, bluetooth (Bluetooth), the nothing such as Wi-Fi Line or cable network.
So far, oneself is through describing the application environment of each embodiment of the present invention and the hardware configuration and work(of relevant device in detail Energy.Below, above-mentioned application environment and relevant device will be based on, proposes each embodiment of the present invention.
As shown in fig.2, it is the program of the embodiment of form data structuring extraction system 20 1 in electronic equipment 2 of the present invention Module map.In the present embodiment, described form data structuring extraction system 20 can be divided into one or more program moulds Block, one or more of program modules are stored in the memory 21, and by one or more processors (this implementation It is the processor 22 in example) it is performed, to complete the present invention.For example, in fig. 2, described form data structuring extraction System 20 can be divided into acquisition module 201, identification module 202 and memory module 203.Program mould alleged by the present invention Block is the series of computation machine programmed instruction section for referring to complete specific function, than program more suitable for describing the form data Implementation procedure of the structuring extraction system 20 in the electronic equipment 2.The function of putting up with each program module 201-203 below is entered Row is described in detail.
The acquisition module 201, for obtaining the positional information and label of word of often being composed a piece of writing in specified document (such as PDF document) Information.In the present embodiment, this can be obtained using specific character recognition tool (such as pdf2html instruments) to specify in document The often positional information and label information of style of writing word.PDF document can be resolved to text by the specific character recognition tool (such as XML file), while parse the positional information and label information of every this word of often being composed a piece of writing in PDF document of style of writing.
Preferably, in the present embodiment, often the positional information of style of writing word includes, but not limited to the left margin of every style of writing word The coordinate informations such as coordinate, upper edge coordinate, textwidth, text size.Wherein, the every a line storage for specifying form in document In adjacent position, i.e., the positional information (such as left margin coordinate) according to word of often composing a piece of writing stores successively.Further, often compose a piece of writing word Label information includes, but not limited to the page number of every style of writing word in the specified document (such as PDF document) (where word of often composing a piece of writing The sequence number of the page), page length, pagewidth etc..
The identification module 202, for the positional information and label information according to word of often composing a piece of writing, the table of document is specified from this Line feed situation and cross-page situation are identified in lattice.
Specifically, identify that line feed situation comprises the following steps A1-A2 in the form for specifying document from this.
(A1) certain table in the specified document is positioned, the positional information of the certain table is obtained, such as the certain table Left margin coordinate, table width (form height) and form length etc..In the present embodiment, document can be specified by this Ad hoc rules, specify the form in document to position to this.For example, it is PDF annual reports that if this, which specifies document, annual report is issued There is clear and definite call format, certain table can be judged according to similar following annual report rule:
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other forms in PDF annual reports have similar Form.
(A2) a line word being successively read according to the positional information of the certain table in the certain table, and according to the row The positional information of word identifies the cell of line feed in trip from the style of writing word.In the present embodiment, can be from the specific table The left margin coordinate of lattice starts, and the first row is read according to the form length of the certain table, and according to the form of the certain table Width is until read last column of the certain table.
Preferably, in the present embodiment, the often style of writing word of the certain table includes multiple cells, such as the 1st cell, the 2 cells, the 3rd cell, the 4th cell.More specifically, the line feed situation includes line feed and end of line line feed in row.Institute Line feed in row is stated to refer to:Line feed situation in the internal element lattice of a line word of the certain table be present.The end of line line feed Refer to:Line feed situation in the tail units lattice of a line word of the certain table be present.
Preferably, in the present embodiment, the positional information according to the style of writing word is identified from the style of writing word in trip The cell of line feed comprises the following steps A21-A22.
(A21) the word content positional information of each cell in the style of writing word is obtained, wherein, the word of each cell The left margin coordinate, upper edge coordinate, text that content-location information includes, but not limited to the word content of each cell are wide The coordinate informations such as degree, text size.
(A22) from the coordinate identical cell of word content upper edge, obtain for the first time occur cell location and The cell location that last time occurs (finds cell location and the last time that identical upper edge coordinate occurs for the first time The cell location of appearance), all lists for the cell location that the cell location that first time occurs once is occurred to the end First lattice are defined as same a line, and judge text between the cell location of the cell location and last time appearance occurred for the first time Coordinate different cell in word content upper edge is the interior cell to enter a new line of row.
Preferably, in other embodiments, the form data line feed identification also includes step:(A3) according to current style of writing The positional information of word identifies the cell of end of line line feed from current line word.
Specifically, the positional information according to current line word identifies the list of end of line line feed from current line word First lattice comprise the following steps A31-A33.
(A31) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A32) the word content upper edge coordinate and current line (such as the first row) and next line of the remaining cell are calculated The distance of the word content upper edge coordinate of (such as the second row) all cells.
(A33) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in next line apart from minimum value, the word content of the remaining cell is incorporated to down A line, and judge end of line line feed cell of the remaining cell for next line.
It should be noted that in other embodiments, the positional information according to current line word is from current line word In identify end of line line feed cell can also comprise the following steps A34-A36.
(A34) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A35) the word content upper edge coordinate and current line (such as the second row) and lastrow of the remaining cell are calculated The distance of the word content upper edge coordinate of (such as the first row) all cells.
(A36) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in lastrow apart from minimum value, the word content of the remaining cell is incorporated to A line, and judge end of line line feed cell of the remaining cell for lastrow.
It should be noted that above-mentioned steps A1-A2, A21-A22, A31-A36 be with the certain table of pdf document (such as Client supplier form) in identify form data line feed situation exemplified by illustrate, it will be understood by those skilled in the art that In other embodiments, all forms that above table information line feed recognition methods can also be directed to pdf document carry out line feed situation Identification, will not be repeated here.
Further, identify that cross-page situation comprises the following steps B1-B3 (methods in the form for specifying document from this One).
(B1) adjacent previous form and next form in document are specified for this, obtains the position of previous form word content Positional information, the label information of confidence breath, label information and next form word content.
Preferably, in the present embodiment, the positional information of the previous form word content includes, but not limited to previous Form is often composed a piece of writing the coordinate informations such as the left margin coordinate of word, upper edge coordinate, textwidth, text size, and previous form is every Left margin coordinate of row word etc..The label information of the previous form word content includes, but not limited to previous form and often gone Word is wide in the page number (sequence number for the page where word of often composing a piece of writing), page length, the page of the specified document (such as PDF document) Degree etc..
Further, the positional information of next form word content includes, but not limited to next form and often composed a piece of writing word Left margin coordinate, upper edge coordinate, textwidth, the coordinate information such as text size, and the left side of next form each column word Along coordinate etc..The label information of next form word content includes, but not limited to next form and often composes a piece of writing word in the finger Determine the page number (sequence number for the page where word of often composing a piece of writing), page length, the pagewidth of document (such as PDF document).
(B2) left margin for comparing the left margin coordinate each column word corresponding with previous form of next form each column word is sat Mark.For example, the left margin coordinate of next row word of form the 1st and the left margin coordinate of the previous row word of form the 1st are compared, The left margin coordinate of next row word of form the 2nd and the left margin coordinate of the previous row word of form the 2nd are compared, the rest may be inferred.
(B3) when the left margin coordinate of the left margin coordinate of next form each column word each column word corresponding with previous form (represent next form and previous form as same form) when all identical, then compare next form often compose a piece of writing word the page number with it is previous Form is often composed a piece of writing the page number of word.For example, the page footing of first page includes previous form, and the beginning of the page of second page includes next table Lattice, wherein, the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word is all identical, Then judge that next form and previous form are same form.
If next form is often composed a piece of writing, often the compose a piece of writing page number of word of the page number of word and previous form exists different, judges next table Lattice and previous form are the same form for existing cross-page situation.If next form is often composed a piece of writing, the page number of word is often composed a piece of writing with previous form The page number of word is all identical, then judges next form and the same form that previous form is in the absence of cross-page situation, i.e., next form It is the same form positioned at the same page with previous form.
Preferably, in the present embodiment, if the left margin coordinate of next form each column word each column corresponding with previous form Difference between the left margin coordinate of word is both less than predetermined threshold value (such as 2 pixel unit values), then judges next form each column The left margin coordinate of the left margin coordinate each column word corresponding with previous form of word is all identical.
It should be noted that above-mentioned steps B1-B3 (method one) is with two neighboring form (the previous table of pdf document Lattice and next form) in identify and illustrate exemplified by the cross-page situation of form data, it will be understood by those skilled in the art that at it In its embodiment, the cross-page identification of above table information can also be directed to pdf document certain table (such as financial form) carry out across Page situation identification (method two), method two comprise the following steps B4-B5.
(B4) certain table in the specified document is positioned, obtains the positional information and label of the certain table word content Information.Wherein, the positional information of the certain table word content includes, but not limited to the certain table and often composed a piece of writing the left side of word Along coordinate informations such as coordinate, upper edge coordinate, textwidth, text sizes.The label information bag of the certain table word content Include, but be not limited to, the certain table often composes a piece of writing the page number of the word in the specified document (such as PDF document) (where word of often composing a piece of writing The sequence number of the page), page length, pagewidth etc..
Specifically, the ad hoc rules of document can be specified by this, specifies the certain table in document to determine to this Position.For example, it is PDF annual reports that if this, which specifies document, annual report issue has clear and definite call format, can be according to similar following year Report rule is judged certain table.
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other certain tables in PDF annual reports have Similar form.
(B5) certain table is successively read according to the positional information (as above edge coordinate) of the certain table word content Often style of writing word (as above edge coordinate identical word is same a line), and according to the label information of the certain table word content Obtain the page number for word of often composing a piece of writing.
If the certain table is often composed a piece of writing, the page number of word is present different, judges that the certain table has cross-page situation and (known Do not go out the certain table and be located at the previous form of the different pages and next form).If the certain table is often composed a piece of writing, the page number of word is all It is identical, then judge that cross-page situation is not present in the certain table.
The memory module 203, for when specified from this identified in form of document line feed situation when, then according to first Form data is carried out branch's storage (branch's extraction list data simultaneously stores) and point row storage (point row extraction form by remodeling rule Data simultaneously store), form the list data of structuring.
Preferably, in the present embodiment, the first remodeling rule includes:Upper edge coordinate identical word is stored as Same a line (branch stores), left margin coordinate identical word is stored as same row (point row storage).
The memory module 203, be additionally operable to when specified from this identify cross-page situation in form of document when, then according to Form data is carried out branch's storage and point row storage by two remodeling rules, forms the list data of structuring.
Preferably, in the present embodiment, the second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced Form splicing form (form for forming same one page);And
Upper edge coordinate identical word in the splicing form is stored as same a line (branch store), and by the spelling Connect left margin coordinate identical word in form and be stored as same row (point row storage).
Specifically, it is described to delete the previous form institute page of the footer in face and next form place next page on the previous page Eyebrow includes:According to the previous page and the label information of next page and the ad hoc rules of the specified document, this is specified in document The footer scope of the previous page and the header scope of next page are positioned, and the footer scope and header determined according to positioning Scope deletes the footer of the previous page and the header of next page.
Wherein, the label information of the previous page includes, but not limited to the page number, page length, the page of the previous page Width etc.;The label information of the next page includes, but not limited to the page number, page length, the pagewidth of next page Deng.This specifies the ad hoc rules of document to include, but not limited to the first ratio of page length shared by the footer of the previous page (such as 8%), the second ratio (such as 9%) of page length shared by the header of next page.It is appreciated that first ratio and second Ratio can also be identical.
Further, the footer scope of the previous page is established rules really, is:According to order from bottom to up, institute is chosen The first ratio content of the page length of the previous page is stated, the footer using the first ratio content of selection as the previous page Scope.The header scope of the next page is established rules really, is:According to order from top to bottom, the next page is chosen Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, in other embodiments, the form data structuring extraction system 20 is additionally operable to:For the knot of storage Structure list data carries out laterally comparing analysis and longitudinal direction compares and analyzed.
Wherein, the laterally comparison analysis includes:It is different public that (such as same year) same industry is compared in same time range The structuring list data (such as accounts receivable) of department, managed to analyze the debt situation of different company, financial condition etc. Information.The longitudinal direction, which compares analysis, to be included:Compare same company in the range of different time (such as nearly 3 years) structuring form Data (such as accounts receivable), to analyze the operation information such as the debt situation of the said firm, financial condition, (such as accounts receivable becomes Change).
By said procedure module 201-203, form data structuring extraction system 20 proposed by the invention, by dividing The positional information and label information of form word content in document (such as PDF document) are specified in analysis, can identify form (such as PDF Annual report form) in line feed situation and cross-page situation, and to occur line feed situation and cross-page situation form carry out structuring number According to extracting and store.For this method without pdf document is converted into the structured documents such as word, excel, data extraction efficiency is high, Recall rate and accuracy rate are higher under large-scale dataset, and beneficial to follow-up laterally comparison is analyzed, longitudinal direction compares analysis and data are built Mould.
In addition, the present invention also proposes a kind of form data structuring extracting method.
As shown in fig.3, it is the implementation process diagram of the embodiment of form data structuring extracting method one of the present invention. In the present embodiment, according to different demands, the execution sequence of the step in flow chart shown in Fig. 3 can change, some steps It can omit.
Step S31, obtain the positional information and label information for specifying word of often being composed a piece of writing in document (such as PDF document).In this reality Apply in example, specific character recognition tool (such as pdf2html instruments) can be used to obtain the position for specifying word of often being composed a piece of writing in document Confidence ceases and label information.The specific character recognition tool can resolve to PDF document text (such as XML file), The positional information and label information of every this word of often being composed a piece of writing in PDF document of style of writing are parsed simultaneously.
Preferably, in the present embodiment, often the positional information of style of writing word includes, but not limited to the left margin of every style of writing word The coordinate informations such as coordinate, upper edge coordinate, textwidth, text size.Wherein, the every a line storage for specifying form in document In adjacent position, i.e., the positional information (such as left margin coordinate) according to word of often composing a piece of writing stores successively.Further, often compose a piece of writing word Label information includes, but not limited to the page number of every style of writing word in the specified document (such as PDF document) (where word of often composing a piece of writing The sequence number of the page), page length, pagewidth etc..
Step S32, according to the positional information and label information of every style of writing word, specified from this and identify and change in form of document Market shape and cross-page situation.
Specifically, identify that line feed situation comprises the following steps A1-A2 in the form for specifying document from this.
(A1) certain table in the specified document is positioned, the positional information of the certain table is obtained, such as the certain table Left margin coordinate, table width (form height) and form length etc..In the present embodiment, document can be specified by this Ad hoc rules, specify the form in document to position to this.For example, it is PDF annual reports that if this, which specifies document, annual report is issued There is clear and definite call format, certain table can be judged according to similar following annual report rule:
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other forms in PDF annual reports have similar Form.
(A2) a line word being successively read according to the positional information of the certain table in the certain table, and according to the row The positional information of word identifies the cell of line feed in trip from the style of writing word.In the present embodiment, can be from the specific table The left margin coordinate of lattice starts, and the first row is read according to the form length of the certain table, and according to the form of the certain table Width is until read last column of the certain table.
Preferably, in the present embodiment, the often style of writing word of the certain table includes multiple cells, such as the 1st cell, the 2 cells, the 3rd cell, the 4th cell.More specifically, the line feed situation includes line feed and end of line line feed in row.Institute Line feed in row is stated to refer to:Line feed situation in the internal element lattice of a line word of the certain table be present.The end of line line feed Refer to:Line feed situation in the tail units lattice of a line word of the certain table be present.
Preferably, in the present embodiment, the positional information according to the style of writing word is identified from the style of writing word in trip The cell of line feed comprises the following steps A21-A22.
(A21) the word content positional information of each cell in the style of writing word is obtained, wherein, the word of each cell The left margin coordinate, upper edge coordinate, text that content-location information includes, but not limited to the word content of each cell are wide The coordinate informations such as degree, text size.
(A22) from the coordinate identical cell of word content upper edge, obtain for the first time occur cell location and The cell location that last time occurs (finds cell location and the last time that identical upper edge coordinate occurs for the first time The cell location of appearance), all lists for the cell location that the cell location that first time occurs once is occurred to the end First lattice are defined as same a line, and judge text between the cell location of the cell location and last time appearance occurred for the first time Coordinate different cell in word content upper edge is the interior cell to enter a new line of row.
Preferably, in other embodiments, the form data line feed recognition methods also includes step:(A3) according to current The positional information of style of writing word identifies the cell of end of line line feed from current line word.
Specifically, the positional information according to current line word identifies the list of end of line line feed from current line word First lattice comprise the following steps A31-A33.
(A31) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A32) the word content upper edge coordinate and current line (such as the first row) and next line of the remaining cell are calculated The distance of the word content upper edge coordinate of (such as the second row) all cells.
(A33) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in next line apart from minimum value, the word content of the remaining cell is incorporated to down A line, and judge end of line line feed cell of the remaining cell for next line.
It should be noted that in other embodiments, the positional information according to current line word is from current line word In identify end of line line feed cell can also comprise the following steps A34-A36.
(A34) if remaining cell be present in current line word after line feed identification in row, the remaining cell is obtained Word content positional information.Wherein, the word content positional information of the remaining cell includes, but not limited to described surplus The coordinate informations such as the left margin coordinate of the word content of remaining cell, upper edge coordinate, textwidth, text size.
(A35) the word content upper edge coordinate and current line (such as the second row) and lastrow of the remaining cell are calculated The distance of the word content upper edge coordinate of (such as the first row) all cells.
(A36) if appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, And judge end of line line feed cell of the remaining cell for current line.
Further, if appearing in lastrow apart from minimum value, the word content of the remaining cell is incorporated to A line, and judge end of line line feed cell of the remaining cell for lastrow.
It should be noted that above-mentioned steps A1-A2, A21-A22, A31-A36 be with the certain table of pdf document (such as Client supplier form) in identify form data line feed situation exemplified by illustrate, it will be understood by those skilled in the art that In other embodiments, all forms that above table information line feed recognition methods can also be directed to pdf document carry out line feed situation Identification, will not be repeated here.
Further, identify that cross-page situation comprises the following steps B1-B3 (methods in the form for specifying document from this One).
(B1) adjacent previous form and next form in document are specified for this, obtains the position of previous form word content Positional information, the label information of confidence breath, label information and next form word content.
Preferably, in the present embodiment, the positional information of the previous form word content includes, but not limited to previous Form is often composed a piece of writing the coordinate informations such as the left margin coordinate of word, upper edge coordinate, textwidth, text size, and previous form is every Left margin coordinate of row word etc..The label information of the previous form word content includes, but not limited to previous form and often gone Word is wide in the page number (sequence number for the page where word of often composing a piece of writing), page length, the page of the specified document (such as PDF document) Degree etc..
Further, the positional information of next form word content includes, but not limited to next form and often composed a piece of writing word Left margin coordinate, upper edge coordinate, textwidth, the coordinate information such as text size, and the left side of next form each column word Along coordinate etc..The label information of next form word content includes, but not limited to next form and often composes a piece of writing word in the finger Determine the page number (sequence number for the page where word of often composing a piece of writing), page length, the pagewidth of document (such as PDF document).
(B2) left margin for comparing the left margin coordinate each column word corresponding with previous form of next form each column word is sat Mark.For example, the left margin coordinate of next row word of form the 1st and the left margin coordinate of the previous row word of form the 1st are compared, The left margin coordinate of next row word of form the 2nd and the left margin coordinate of the previous row word of form the 2nd are compared, the rest may be inferred.
(B3) when the left margin coordinate of the left margin coordinate of next form each column word each column word corresponding with previous form (represent next form and previous form as same form) when all identical, then compare next form often compose a piece of writing word the page number with it is previous Form is often composed a piece of writing the page number of word.For example, the page footing of first page includes previous form, and the beginning of the page of second page includes next table Lattice, wherein, the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word is all identical, Then judge that next form and previous form are same form.
If next form is often composed a piece of writing, often the compose a piece of writing page number of word of the page number of word and previous form exists different, judges next table Lattice and previous form are the same form for existing cross-page situation.If next form is often composed a piece of writing, the page number of word is often composed a piece of writing with previous form The page number of word is all identical, then judges next form and the same form that previous form is in the absence of cross-page situation, i.e., next form It is the same form positioned at the same page with previous form.
Preferably, in the present embodiment, if the left margin coordinate of next form each column word each column corresponding with previous form Difference between the left margin coordinate of word is both less than predetermined threshold value (such as 2 pixel unit values), then judges next form each column The left margin coordinate of the left margin coordinate each column word corresponding with previous form of word is all identical.
It should be noted that above-mentioned steps B1-B3 (method one) is with two neighboring form (the previous table of pdf document Lattice and next form) in identify and illustrate exemplified by the cross-page situation of form data, it will be understood by those skilled in the art that at it In its embodiment, the certain table (such as financial form) that the cross-page recognition methods of above table information can also be directed to pdf document is entered The cross-page situation identification (method two) of row, method two comprise the following steps B4-B5.
(B4) certain table in the specified document is positioned, obtains the positional information and label of the certain table word content Information.Wherein, the positional information of the certain table word content includes, but not limited to the certain table and often composed a piece of writing the left side of word Along coordinate informations such as coordinate, upper edge coordinate, textwidth, text sizes.The label information bag of the certain table word content Include, but be not limited to, the certain table often composes a piece of writing the page number of the word in the specified document (such as PDF document) (where word of often composing a piece of writing The sequence number of the page), page length, pagewidth etc..
Specifically, the ad hoc rules of document can be specified by this, specifies the certain table in document to determine to this Position.For example, it is PDF annual reports that if this, which specifies document, annual report issue has clear and definite call format, can be according to similar following year Report rule is judged certain table.
When such as introducing major customer and supplier, form caption can be set to " main trade debtor and main SUPPLIER INFORMATION Condition ", therefore this header is exactly the certain table of client supplier., then can be with according to the title keyword of certain table The form for introducing certain content is positioned, facilitates follow-up parsing.Similarly, other certain tables in PDF annual reports have Similar form.
(B5) certain table is successively read according to the positional information (as above edge coordinate) of the certain table word content Often style of writing word (as above edge coordinate identical word is same a line), and according to the label information of the certain table word content Obtain the page number for word of often composing a piece of writing.
If the certain table is often composed a piece of writing, the page number of word is present different, judges that the certain table has cross-page situation and (known Do not go out the certain table and be located at the previous form of the different pages and next form).If the certain table is often composed a piece of writing, the page number of word is all It is identical, then judge that cross-page situation is not present in the certain table.
Step S33, when specified from this line feed situation is identified in form of document when, then according to the first remodeling rule by table Lattice information carries out branch's storage (branch's extraction list data simultaneously stores) and point row storage (point row extraction list data simultaneously stores), Form the list data of structuring.
Preferably, in the present embodiment, the first remodeling rule includes:Upper edge coordinate identical word is stored as Same a line (branch stores), left margin coordinate identical word is stored as same row (point row storage).
Step S34, when specified from this identify cross-page situation in form of document when, then according to the second remodeling rule by table Lattice information carries out branch's storage and point row storage, forms the list data of structuring.
Preferably, in the present embodiment, the second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced Form splicing form (form for forming same one page);And
Upper edge coordinate identical word in the splicing form is stored as same a line (branch store), and by the spelling Connect left margin coordinate identical word in form and be stored as same row (point row storage).
Specifically, it is described to delete the previous form institute page of the footer in face and next form place next page on the previous page Eyebrow includes:According to the previous page and the label information of next page and the ad hoc rules of the specified document, this is specified in document The footer scope of the previous page and the header scope of next page are positioned, and the footer scope and header determined according to positioning Scope deletes the footer of the previous page and the header of next page.
Wherein, the label information of the previous page includes, but not limited to the page number, page length, the page of the previous page Width etc.;The label information of the next page includes, but not limited to the page number, page length, the pagewidth of next page Deng.This specifies the ad hoc rules of document to include, but not limited to the first ratio of page length shared by the footer of the previous page (such as 8%), the second ratio (such as 9%) of page length shared by the header of next page.It is appreciated that first ratio and second Ratio can also be identical.
Further, the footer scope of the previous page is established rules really, is:According to order from bottom to up, institute is chosen The first ratio content of the page length of the previous page is stated, the footer using the first ratio content of selection as the previous page Scope.The header scope of the next page is established rules really, is:According to order from top to bottom, the next page is chosen Second ratio content of page length, the header scope using the second ratio content of selection as the next page.
Preferably, in other embodiments, the form data structuring extracting method also includes step:For storage Structuring list data carries out laterally comparing analysis and longitudinal direction compares and analyzed.
Wherein, the laterally comparison analysis includes:It is different public that (such as same year) same industry is compared in same time range The structuring list data (such as accounts receivable) of department, managed to analyze the debt situation of different company, financial condition etc. Information.The longitudinal direction, which compares analysis, to be included:Compare same company in the range of different time (such as nearly 3 years) structuring form Data (such as accounts receivable), to analyze the operation information such as the debt situation of the said firm, financial condition, (such as accounts receivable becomes Change).
By above-mentioned steps S31-S34 and other correlation steps, form data structuring extraction side proposed by the invention Method, the positional information and label information of form word content in document (such as PDF document) are specified by analyzing, table can be identified Line feed situation and cross-page situation in lattice (such as PDF annual reports form), and carried out to there is the form of line feed situation and cross-page situation Structural data is extracted and stored.This method by pdf document without being converted into the structured documents such as word, excel, data extraction Efficiency high, recall rate and accuracy rate are higher under large-scale dataset, and beneficial to it is follow-up laterally compare analysis, longitudinal direction compares analysis and Data modeling.
Further, to achieve the above object, the present invention also provide a kind of computer-readable recording medium (such as ROM/RAM, Magnetic disc, CD), the computer-readable recording medium storage has form data structuring extraction system 20, the form data Structuring extraction system 20 can be performed by least one processor 22, so that at least one processor 22 performs as described above Form data structuring extracting method the step of.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to realized by hardware, but a lot In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate Machine, server, air conditioner, or network equipment etc.) perform method described in each embodiment of the present invention.
Above by reference to the preferred embodiments of the present invention have been illustrated, not thereby limit to the interest field of the present invention.On State that sequence number of the embodiment of the present invention is for illustration only, do not represent the quality of embodiment.Patrolled in addition, though showing in flow charts Order is collected, but in some cases, can be with the step shown or described by being performed different from order herein.
Those skilled in the art do not depart from the scope of the present invention and essence, can have a variety of flexible programs to realize the present invention, It can be used for another embodiment for example as the feature of one embodiment and obtain another embodiment.It is every to utilize description of the invention And the equivalent structure made of accompanying drawing content or equivalent flow conversion, or other related technical areas are directly or indirectly used in, It is included within the scope of the present invention.

Claims (10)

1. a kind of electronic equipment, it is characterised in that the electronic equipment includes memory and processor, is stored on the memory There is the form data structuring extraction system that can be run on the processor, the form data structuring extraction system is by institute Following steps are realized when stating computing device:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to the positional information and label information of every style of writing word, specified from this and line feed situation and cross-page is identified in form of document Situation;
When specified from this line feed situation is identified in form of document when, then form data divided according to the first remodeling rule Row storage and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data divided according to the second remodeling rule Row storage and point row storage.
2. electronic equipment as claimed in claim 1, it is characterised in that the first remodeling rule includes:By upper edge coordinate Identical word is stored as same a line, and left margin coordinate identical word is stored as into same row;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced to form Splice form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and splices left margin in form by described Coordinate identical word is stored as same row.
3. electronic equipment as claimed in claim 2, it is characterised in that it is described delete previous form face on the previous page footer Include with the header of next page where next form:
According to the previous page and the label information of next page and this specify document ad hoc rules, before being specified to this in document The footer scope of one page and the header scope of next page are positioned, and the footer scope and header model determined according to positioning Enclose and delete the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the previous page is chosen First ratio content of page length, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:The order of foundation from top to bottom, choose the page of the next page Second ratio content of length, the header scope using the second ratio content of selection as the next page.
4. electronic equipment as claimed in claim 2, it is characterised in that the line feed situation includes line feed and end of line in row and changed OK;
Wherein, the identification of line feed includes in the row:
The word content positional information of each cell in the style of writing word is obtained, wherein, the word content position of each cell Information includes the upper edge coordinate of the word content of each cell;And
From the coordinate identical cell of word content upper edge, obtain the cell location occurred for the first time and last time goes out Existing cell location, all cells for the cell location that the cell location that first time occurs once is occurred to the end It is defined as same a line, and judges between the cell location occurred for the first time and the cell location occurred for the last time in word Hold cell of the coordinate different cell in upper edge for line feed in row.
5. electronic equipment as claimed in claim 4, it is characterised in that the identification of the end of line line feed includes:
If remaining cell be present in current line word after line feed identification in row, the word content of the remaining cell is obtained Positional information, wherein, the word content positional information of the remaining cell includes the word content of the remaining cell Upper edge coordinate;
Calculate in the word content upper edge coordinate and the word of current line and all cells of next line of the remaining cell Hold the distance of upper edge coordinate, or calculate word content upper edge coordinate and current line and the lastrow institute of the remaining cell There is the distance of the word content upper edge coordinate of cell;And
If appearing in current line apart from minimum value, the word content of the remaining cell is incorporated to current line, and judge institute State the end of line line feed cell that remaining cell is current line.
A kind of 6. form data structuring extracting method, applied to electronic equipment, it is characterised in that methods described includes:
Obtain the positional information and label information for specifying word of often being composed a piece of writing in document;
According to the positional information and label information of every style of writing word, specified from this and line feed situation and cross-page is identified in form of document Situation;
When specified from this line feed situation is identified in form of document when, then form data divided according to the first remodeling rule Row storage and point row storage;And
When specified from this identify cross-page situation in form of document when, then form data divided according to the second remodeling rule Row storage and point row storage.
7. form data structuring extracting method as claimed in claim 6, it is characterised in that the first remodeling rule bag Include:Upper edge coordinate identical word is stored as same a line, left margin coordinate identical word is stored as same row;
The second remodeling rule includes:
Delete previous form next page where the footer in face and next form on the previous page header;
Previous form word content after deletion footer and next form word content after deletion header are spliced to form Splice form;And
Upper edge coordinate identical word in the splicing form is stored as same a line, and splices left margin in form by described Coordinate identical word is stored as same row.
8. form data structuring extracting method as claimed in claim 7, it is characterised in that described to delete previous form place The header of next page includes where the footer of the previous page and next form:
According to the previous page and the label information of next page and this specify document ad hoc rules, before being specified to this in document The footer scope of one page and the header scope of next page are positioned, and the footer scope and header model determined according to positioning Enclose and delete the footer of the previous page and the header of next page;
Wherein, the footer scope of the previous page is established rules really, is:According to order from bottom to up, the previous page is chosen First ratio content of page length, the footer scope using the first ratio content of selection as the previous page;And
The header scope of next page is established rules really, is:The order of foundation from top to bottom, choose the page of the next page Second ratio content of length, the header scope using the second ratio content of selection as the next page.
9. form data structuring extracting method as claimed in claim 7, it is characterised in that the identification bag of the cross-page situation Include:
Specify adjacent previous form and next form in document for this, obtain previous form word content positional information, The positional information of label information and next form word content, label information;
Compare the left margin coordinate of the left margin coordinate each column word corresponding with previous form of next form each column word;
When the left margin coordinate of the left margin coordinate of next form each column word each column word corresponding with previous form is all identical, Compare that next form often composes a piece of writing the page number of word and previous form is often composed a piece of writing the page number of word;And
If next form is often composed a piece of writing the page number of word and previous form often compose a piece of writing word the page number exist it is different, judge next form and Previous form is the same form for existing cross-page situation.
10. a kind of computer-readable recording medium, the computer-readable recording medium storage has form data structuring extraction System, the form data structuring extraction system can be by least one computing devices, so that at least one processor The step of performing the form data structuring extracting method as any one of claim 6-9.
CN201710962303.5A 2017-10-16 2017-10-16 Form data structuring extracting method, electronic equipment and computer-readable recording medium Pending CN107818075A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710962303.5A CN107818075A (en) 2017-10-16 2017-10-16 Form data structuring extracting method, electronic equipment and computer-readable recording medium
PCT/CN2018/076167 WO2019075969A1 (en) 2017-10-16 2018-02-10 Method for extracting form information in a structured manner, electronic device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710962303.5A CN107818075A (en) 2017-10-16 2017-10-16 Form data structuring extracting method, electronic equipment and computer-readable recording medium

Publications (1)

Publication Number Publication Date
CN107818075A true CN107818075A (en) 2018-03-20

Family

ID=61608392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710962303.5A Pending CN107818075A (en) 2017-10-16 2017-10-16 Form data structuring extracting method, electronic equipment and computer-readable recording medium

Country Status (2)

Country Link
CN (1) CN107818075A (en)
WO (1) WO2019075969A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109002425A (en) * 2018-06-19 2018-12-14 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of enterprise's upstream-downstream relationship
CN109062874A (en) * 2018-06-12 2018-12-21 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of financial data
CN109522538A (en) * 2018-11-28 2019-03-26 腾讯科技(深圳)有限公司 Table content divides column method, apparatus, equipment and storage medium automatically
CN109542898A (en) * 2018-10-30 2019-03-29 天津字节跳动科技有限公司 Date storage method, device, electronic equipment and the storage medium of data bank table
CN109871524A (en) * 2019-02-21 2019-06-11 腾讯科技(深圳)有限公司 A kind of chart generation method and device
CN110032718A (en) * 2019-04-12 2019-07-19 广州广燃设计有限公司 A kind of table conversion method, system and storage medium
CN110489423A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN110489424A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction
CN110909123A (en) * 2019-10-23 2020-03-24 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium
CN112287660A (en) * 2019-12-04 2021-01-29 上海柯林布瑞信息技术有限公司 Method and device for analyzing table in PDF file, computing equipment and storage medium
CN112380825A (en) * 2020-11-17 2021-02-19 平安科技(深圳)有限公司 PDF document page-crossing table merging method and device, electronic equipment and storage medium
CN112464626A (en) * 2020-12-09 2021-03-09 上海携宁计算机科技股份有限公司 Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium
CN112632927A (en) * 2020-12-30 2021-04-09 上海犀语科技有限公司 Table fragment link restoration method and system based on semantic processing
CN112651331A (en) * 2020-12-24 2021-04-13 万兴科技集团股份有限公司 Text table extraction method, system, computer device and storage medium
CN113111864A (en) * 2021-05-13 2021-07-13 上海巽联信息科技有限公司 Intelligent table extraction algorithm based on multiple modes
CN113361257A (en) * 2021-06-29 2021-09-07 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN113869014A (en) * 2021-08-25 2021-12-31 盐城金堤科技有限公司 Extraction method and device of table data, storage medium and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11436249B1 (en) 2021-03-26 2022-09-06 International Business Machines Corporation Transformation of composite tables into structured database content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508826A (en) * 2011-11-03 2012-06-20 汉王科技股份有限公司 Method and device for displaying table in document
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN102855232A (en) * 2012-09-14 2013-01-02 同方光盘股份有限公司 Table analysis and edit processing method
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090282009A1 (en) * 2008-05-09 2009-11-12 Tags Ltd System, method, and program product for automated grading
US20120265759A1 (en) * 2011-04-15 2012-10-18 Xerox Corporation File processing of native file formats
CN104268127B (en) * 2014-09-22 2018-02-09 同方知网(北京)技术有限公司 A kind of method of electronics shelves layout files reading order analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508826A (en) * 2011-11-03 2012-06-20 汉王科技股份有限公司 Method and device for displaying table in document
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN102855232A (en) * 2012-09-14 2013-01-02 同方光盘股份有限公司 Table analysis and edit processing method
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062874A (en) * 2018-06-12 2018-12-21 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of financial data
CN109062874B (en) * 2018-06-12 2022-03-04 平安科技(深圳)有限公司 Financial data acquisition method, terminal device and medium
WO2019237540A1 (en) * 2018-06-12 2019-12-19 平安科技(深圳)有限公司 Method and device for acquiring financial data, terminal device, and medium
CN109002425A (en) * 2018-06-19 2018-12-14 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of enterprise's upstream-downstream relationship
CN109002425B (en) * 2018-06-19 2022-03-22 平安科技(深圳)有限公司 Method for acquiring upstream and downstream relations of enterprise, terminal device and medium
CN109542898A (en) * 2018-10-30 2019-03-29 天津字节跳动科技有限公司 Date storage method, device, electronic equipment and the storage medium of data bank table
US11487935B2 (en) 2018-11-28 2022-11-01 Tencent Technology (Shenzhen) Company Limited Method and apparatus for automatically splitting table content into columns, computer device, and storage medium
CN109522538B (en) * 2018-11-28 2021-10-29 腾讯科技(深圳)有限公司 Automatic listing method, device, equipment and storage medium for table contents
WO2020108257A1 (en) * 2018-11-28 2020-06-04 腾讯科技(深圳)有限公司 Method and device for automatically splitting table content into columns, computer apparatus, and storage medium
CN109522538A (en) * 2018-11-28 2019-03-26 腾讯科技(深圳)有限公司 Table content divides column method, apparatus, equipment and storage medium automatically
CN109871524A (en) * 2019-02-21 2019-06-11 腾讯科技(深圳)有限公司 A kind of chart generation method and device
CN110032718B (en) * 2019-04-12 2023-04-18 广州广燃设计有限公司 Table conversion method, system and storage medium
CN110032718A (en) * 2019-04-12 2019-07-19 广州广燃设计有限公司 A kind of table conversion method, system and storage medium
CN110489424A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction
CN110489423B (en) * 2019-08-26 2021-10-08 北京香侬慧语科技有限责任公司 Information extraction method and device, storage medium and electronic equipment
CN110489423A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN110909123A (en) * 2019-10-23 2020-03-24 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium
CN110909123B (en) * 2019-10-23 2023-08-25 深圳价值在线信息科技股份有限公司 Data extraction method and device, terminal equipment and storage medium
CN112287660A (en) * 2019-12-04 2021-01-29 上海柯林布瑞信息技术有限公司 Method and device for analyzing table in PDF file, computing equipment and storage medium
CN112380825A (en) * 2020-11-17 2021-02-19 平安科技(深圳)有限公司 PDF document page-crossing table merging method and device, electronic equipment and storage medium
WO2022105172A1 (en) * 2020-11-17 2022-05-27 平安科技(深圳)有限公司 Pdf document cross-page table merging method and apparatus, electronic device and storage medium
CN112380825B (en) * 2020-11-17 2022-07-15 平安科技(深圳)有限公司 PDF document cross-page table merging method and device, electronic equipment and storage medium
CN112464626A (en) * 2020-12-09 2021-03-09 上海携宁计算机科技股份有限公司 Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium
CN112464626B (en) * 2020-12-09 2022-04-01 上海携宁计算机科技股份有限公司 Graph extraction method of PDF (Portable document Format) document, electronic equipment and storage medium
CN112651331A (en) * 2020-12-24 2021-04-13 万兴科技集团股份有限公司 Text table extraction method, system, computer device and storage medium
CN112651331B (en) * 2020-12-24 2024-04-16 万兴科技集团股份有限公司 Text form extraction method, system, computer device and storage medium
CN112632927A (en) * 2020-12-30 2021-04-09 上海犀语科技有限公司 Table fragment link restoration method and system based on semantic processing
CN113111864A (en) * 2021-05-13 2021-07-13 上海巽联信息科技有限公司 Intelligent table extraction algorithm based on multiple modes
CN113361257A (en) * 2021-06-29 2021-09-07 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN113869014A (en) * 2021-08-25 2021-12-31 盐城金堤科技有限公司 Extraction method and device of table data, storage medium and electronic equipment

Also Published As

Publication number Publication date
WO2019075969A1 (en) 2019-04-25

Similar Documents

Publication Publication Date Title
CN107818075A (en) Form data structuring extracting method, electronic equipment and computer-readable recording medium
CN107832676A (en) Form data line feed recognition methods, electronic equipment and computer-readable recording medium
CN107844468A (en) The cross-page recognition methods of form data, electronic equipment and computer-readable recording medium
CN106156239B (en) Table extraction method and device
CN107688789A (en) Document charts abstracting method, electronic equipment and computer-readable recording medium
CN102270206A (en) Method and device for capturing valid web page contents
CN107689070A (en) Chart data structuring extracting method, electronic equipment and computer-readable recording medium
CN107679084A (en) Cluster labels generation method, electronic equipment and computer-readable recording medium
CN104239298A (en) Text message recommendation method, server, browser and system
CN108509424A (en) Institutional information processing method, device, computer equipment and storage medium
CN114238575A (en) Document parsing method, system, computer device and computer-readable storage medium
CN108038120A (en) Collaborative filtering recommending method, electronic equipment and computer-readable recording medium
CN111191079A (en) Document content acquisition method, device, equipment and storage medium
CN109828756A (en) The method and electronic device of the code of insurance page are generated based on wechat small routine
CN110020312A (en) The method and apparatus for extracting Web page text
CN105320734A (en) Web page core content extraction method
CN110516048A (en) The extracting method, equipment and storage medium of list data in pdf document
CN111369294B (en) Software cost estimation method and device
CN103942211A (en) Text page recognition method and device
CN107766322A (en) Entity recognition method, electronic equipment and computer-readable recording medium of the same name
CN106777281A (en) For improving web crawlers stability, the data processing method of availability and device
CN109710224A (en) Page processing method, device, equipment and storage medium
CN105589918A (en) Method and device for extracting page information
CN107832374A (en) Construction method, electronic installation and the storage medium in standard knowledge storehouse
CN111679825A (en) Cascading style sheet generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination