CN106709032A - Method and device for extracting structured information from spreadsheet document - Google Patents

Method and device for extracting structured information from spreadsheet document Download PDF

Info

Publication number
CN106709032A
CN106709032A CN201611245472.9A CN201611245472A CN106709032A CN 106709032 A CN106709032 A CN 106709032A CN 201611245472 A CN201611245472 A CN 201611245472A CN 106709032 A CN106709032 A CN 106709032A
Authority
CN
China
Prior art keywords
row
cell
business
electronic form
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611245472.9A
Other languages
Chinese (zh)
Other versions
CN106709032B (en
Inventor
张军
贾西贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaao Data Technology Co Ltd
Original Assignee
Shenzhen Huaao Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaao Data Technology Co Ltd filed Critical Shenzhen Huaao Data Technology Co Ltd
Priority to CN201611245472.9A priority Critical patent/CN106709032B/en
Publication of CN106709032A publication Critical patent/CN106709032A/en
Application granted granted Critical
Publication of CN106709032B publication Critical patent/CN106709032B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

The invention relates to the data processing technical filed, particular to a method and device for extracting structured information from spreadsheet document. The method for extracting structured information from spreadsheet document comprises: all business forms in the spreadsheet document are obtained through the isolated table recognition algorithm; the business form is laid out and analyzed; the content is extracted from the business form according to the result of the layout and analysis, the corresponding conversion is made to obtain the structured information. The method and device for extracting structured information from spreadsheet document realizes the function of automatically obtaining all business forms of the spreadsheet documents in batch, large-scale data extraction efficiency is increased.

Description

Extract the method and device of structured message in electronic form document
Technical field
The present invention relates to technical field of data processing, and in particular to structured message in a kind of extraction electronic form document Method and device.
Background technology
Electronic form document, such as Excel, although be spreadsheet, but be still destructuring or semi-structured Data.And multiple tabs are had in an electronic form document, multiple isolated business forms are there may be in each tab, And the layout of each business form may be very random.So the data in form have no idea directly to use, need to extract laggard Structural data is converted into after the certain treatment of row.Existing data pick-up algorithm is difficult to process so complicated, changeable situation.
The content of the invention
For defect of the prior art, in the extraction electronic form document that the present invention is provided the method for structured message and Device, realizes the function that automatic batch obtains all business forms in electronic form document, improves large-scale data extraction Efficiency.
In a first aspect, the present invention provide a kind of extraction electronic form document in structured message method, including:Pass through Isolated Table recognition algorithm obtains all business forms in electronic form document;Analysis is laid out to the business form;Root Content is extracted from the business form according to topological analysis's result, and does corresponding conversion process and obtain structured message.
The method of structured message in the extraction electronic form document that the present invention is provided, can by isolated Table recognition algorithm All independent business forms in electronic form document are obtained with automatic batch, the efficiency of large-scale data extraction is improve; Business datum is extracted again after analysis is laid out to business form, the reliability of extracted data is improve, especially to big rule When the identification of mould semi-structured data and extraction more effectively.
Preferably, it is described by all business forms in isolated Table recognition algorithm acquisition electronic form document, including:Build Vertical two two dimension bit arrays of size identical with the electronic form document, are designated as A and B;Travel through the electronic form document In all cells, if there is content in cell, in A relevant position be labeled as 1, otherwise labeled as 0;Travel through the electronics All cells in form document, the frame line according to cell is marked to B;If the value in B is identical bits in 1, A The value put is set to 1;Business form coordinate in the electronic form document is obtained according to the A after renewal.
Preferably, all cells in the traversal electronic form document, the frame line according to cell enters to B Line flag, including:The all cells in the electronic form document are traveled through, if at least one side of four corners of cell There are two frame lines in angle, then relevant position is labeled as 1 in B.
Preferably, all cells in the traversal electronic form document, if four corners of cell are at least With the presence of a frame line of corner two, then after relevant position is labeled as 1 in B, also include:Step S132, travels through described again All cells in electronic form document, if it is 0 that cell has on frame line, and B respective value, and with the cell phase At least one is marked as 1 to value of adjacent four cells up and down in B, then the position by the cell in B Labeled as 1;Step S133, travels through all cells in the electronic form document again, if cell respective value on B is 0, and in comprising the cell 2 × 2 region, other three cells respective values on B are all 1, then marked on B The cell is 1, and counter adds 1;Step S134, if the counter is not 0, the counter O reset, again Perform step S133.
Preferably, the A according to after renewal obtains the business form coordinate in the electronic form document, including:It is right A after renewal carries out reduction operation, obtains LA;Business form coordinate traffic table in the electronic form document is obtained according to LA Lattice coordinate.
Preferably, the A after described pair of renewal carries out reduction operation, obtains LA, including:In A being begun stepping through from the leftmost side of A All of row, if there is 1 value in row, record the row coordinate X1 of row, terminate traversal;From the rightmost side of A begin stepping through A in institute Some row, if there is 1 value in row, record the row coordinate X2 of row, terminate traversal;From the top side of A begin stepping through A in own Row, if there is 1 value in row, record capable row coordinate Y1, terminate traversal;From the lower side of A begin stepping through A in it is all of OK, if there is 1 value in row, capable row coordinate Y2 is recorded, terminates traversal;Extract the number of [X1, X2, Y1, Y2] position in A According to, formation two dimension bit array LA, and the coordinate mapping relations of LA and A are determined according to X1, X2, Y1, Y2.
Preferably, the business form coordinate obtained according to LA in the electronic form document, including:If owning in LA Value is all 1, then only one of which form in the electronic form document, and business form coordinate is [X1, X2, Y1, Y2];Otherwise, examine Survey X1 row in the electronic form document, whether the cell of Y1 rows is empty, if cell is not sky, examine to the right always Remaining element lattice are surveyed, until detecting mentioned null cell, the row coordinate for recording mentioned null cell is X3, and X1 row are detected from the top down Whether cell is empty, and until detecting mentioned null cell, the row coordinate for recording mentioned null cell is the maximum row coordinate of X1 row, after Continuous detection next column, until having detected that X3 is arranged, if maximum is Y3 in all maximum row coordinates, business form coordinate is [X1, X3, Y1, Y3], 0 is set to by the content in LA with [X1, X3, Y1, Y3] opposite position, obtains new LA;According to renewal LA afterwards obtains the business form coordinate in the electronic form document, all industry in extracting the electronic form document Business form.
Preferably, it is described that analysis is laid out to the business form, including:Detect the title portion in the business form Point;Extract in the business form except the various dimensions information of title division;The various dimensions information according to extracting judges form Layout.
Second aspect, the device of structured message in a kind of extraction electronic form document that the present invention is provided, including:Business Form acquisition module, for by all business forms in isolated Table recognition algorithm acquisition electronic form document;Table-layout Analysis module, for being laid out analysis to the business form;Form data abstraction module, for according to topological analysis's result Content is extracted from the business form, and does corresponding conversion process and obtain structured message.
The device of structured message in the extraction electronic form document that the present invention is provided, can by isolated Table recognition algorithm All independent business forms in electronic form document are obtained with automatic batch, the efficiency of large-scale data extraction is improve; Business datum is extracted again after analysis is laid out to business form, the reliability of extracted data is improve, especially to big rule When the identification of mould semi-structured data and extraction more effectively.
Preferably, the business form acquisition module specifically for:Set up identical with the size of the electronic form document Two two dimension bit arrays, be designated as A and B;The all cells in the electronic form document are traveled through, if in having in cell Hold, then relevant position is labeled as 1 in A, otherwise labeled as 0;The all cells in the electronic form document are traveled through, according to list The frame line of first lattice is marked to B;If the value that the value in B is same position in 1, A is set to 1;Obtained according to the A after renewal Business form coordinate in the electronic form document.
Brief description of the drawings
The flow chart of the method for structured message in the extraction electronic form document that Fig. 1 is provided by the embodiment of the present invention;
Fig. 2 is the layout of the title division, remarks section and business datum part in an example table;
Fig. 3 is the example of many TL layouts in longitudinal direction;
Fig. 4 is the example of laterally many TL layouts;
Fig. 5 is an example for the form of many TL layouts cut merging;
Fig. 6 is an example for the form of many TL layouts cut merging;
Fig. 7 is the example processed the form of single TL (multistage) layouts;
Fig. 8 is an example in the electronic document comprising multiple separate business forms;
Fig. 9 is a form for being only provided with outside wire;
The structural frames of the device of structured message in the extraction electronic form document that Figure 10 is provided by the embodiment of the present invention Figure.
Specific embodiment
The embodiment of technical solution of the present invention is described in detail below in conjunction with accompanying drawing.Following examples are only used for Technical scheme is clearly illustrated, therefore is intended only as example, and protection of the invention can not be limited with this Scope.
It should be noted that unless otherwise indicated, technical term used in this application or scientific terminology should be this hair The ordinary meaning that bright one of ordinary skill in the art are understood.
As shown in figure 1, a kind of method for extracting structured message in electronic form document is present embodiments provided, including:
Step S1, by all business forms in isolated Table recognition algorithm acquisition electronic form document.
Wherein, common electronic form document is including the ods files of Excel, open office etc., but is not limited to above-mentioned The content enumerated.As shown in figure 8, multiple separate business forms may be included in an electronic document, by isolating table Lattice recognizer extracts all business forms in electronic form document respectively.Business form refers to the form comprising business datum.
Step S2, analysis is laid out to business form.
Step S3, extracts content from business form, and does corresponding conversion process and tied according to topological analysis's result Structure information.
Wherein, conversion process is including splitting merging data block, deleting blank line, replacement spcial character etc..
The method of structured message in the extraction electronic form document that the present embodiment is provided, by isolated Table recognition algorithm All independent business forms in electronic form document can be obtained with automatic batch, the effect of large-scale data extraction is improve Rate;Business datum is extracted again after analysis is laid out to business form, the reliability of extracted data is improve, especially to big When the identification of scale semi-structured data and extraction more effectively.
In order to improve the accuracy rate of extraction business form, the isolated Table recognition algorithm in step S1 specifically includes following step Suddenly:
Step S11, sets up the two two dimension bit arrays of size identical with electronic form document, is designated as A and B.
Wherein, the size of electronic form document refers to electronic form document how many cell, the line number of two-dimentional bit arrays Equal to the line number of cell, the columns of two-dimentional bit arrays is equal to the columns of cell.Compared to the data type of other structures, two Dimension bit array most save spaces, and deal with more convenient, it is favorably improved processing speed.
Step S12, all cells in traversal electronic form document, if there is content in cell, relevant position in A Labeled as 1, otherwise labeled as 0.
Wherein, the cell in electronic form document is corresponded with the element in two dimension bit arrays A, even the first row There is content in first row cell, then the first row first row of two dimension bit arrays A is labeled as 1.The purpose of step S12 is basis Those belong to the cell of business form for content-label in cell, mark the corresponding cell in the position for being to belong to industry in A Business form.
Step S13, all cells in traversal electronic form document, the frame line according to cell is marked to B.
Wherein, whether the purpose of step S13 is that have frame line according to cell to mark those to belong to the list of business form First lattice, mark the corresponding cell in the position for being to belong to business form in B.When there is null in business form, unit is only leaned on Lattice content judges that accuracy rate can be reduced, therefore, add frame line determination methods to improve the accuracy rate of judgement.
Step S15, if the value that the value in B is same position in 1, A is set to 1.The purpose of the step is that spill tag in A is remembered Cell supplemented.
Step S16, the business form coordinate in electronic form document is obtained according to the A after renewal.
Further, the preferred embodiment of step S13 specifically includes following steps:
Step S131, traversal electronic form document in all cells, if four corners of cell at least one There are two frame lines in corner, if four corners of cell all have two frame lines, i.e., the two of the angle are when existing The situation of wire, i.e. shape such as ┌, ┐, └, ┘, then relevant position is labeled as 1 in B.
Wherein, frame line refers to the frame line of the true display set for cell, rather than as used for convenience in excel tables Distinguish the boost line of each cell in family.
Some forms are only provided with outside wire, and inside is not provided with frame line, as shown in figure 9, in this case, step Rapid S131 can only recognize the cell on four angles of form, i.e. four cells of Fig. 9 acceptances of the bid " 1 ".
Step S132, travels through all cells in electronic form document, if cell is present on frame line, and B again Respective value is 0, and value of adjacent with the cell four cells up and down in B at least one be marked as 1, then Position mark by cell in B is 1.
The cell that outer frame line one is enclosed, the i.e. cell of Fig. 9 acceptances of the bid " 2 " can be recognized by step S132.
Step S133, travels through all cells in electronic form document again, if cell respective value on B is 0, and In comprising the cell 2 × 2 region, other three cells respective values on B are all 1, then cell is marked on B It is 1, and counter adds 1.
Wherein, the numerical value of counter is used to record and marked several cells in this ergodic process.
By continuous repeat step S133, it is possible to complete the mark inside form, the cell of such as Fig. 9 acceptances of the bid " 3 ".
Step S134, after execution of step S133, if counter is not 0, counter O reset re-executes step S133。
Wherein, if counter is not 0, the cell that expression there may also be omission is unmarked, then need return to step S133, It is marked again.If counter is 0, represents and all of cell in document is marked, then no longer travel through electronics Form document, so far, all regions that frame line is included have been collectively labeled as 1.
Specifically, step S16 includes:
Step S161, reduction operation is carried out to the A after renewal, obtains LA.
Wherein, reduction operation is, in order to remove a large amount of contents for being not belonging to business table section in A, to reduce useless in A Data, reduce data processing amount, are favorably improved the efficiency of extraction business form coordinate.
Step S162, the business form coordinate business form coordinate in electronic form document is obtained according to LA.
Wherein, step S161 is specifically included:
Step S1611, from the leftmost side of A begin stepping through A in all of row, if row in exist 1 value, record the row of row Coordinate X1, terminates traversal.
Step S1612, from the rightmost side of A begin stepping through A in all of row, if row in exist 1 value, record the row of row Coordinate X2, terminates traversal.
Step S1613, from the top side of A begin stepping through A in all of row, if row in exist 1 value, record row row Coordinate Y1, terminates traversal.
Step S1614, from the lower side of A begin stepping through A in all of row, if row in exist 1 value, record row row Coordinate Y2, terminates traversal.
Step S1615, extracts the data of [X1, X2, Y1, Y2] position in A, forms two dimension bit array LA, and according to X1, X2, Y1, Y2 determine the coordinate mapping relations of LA and A, LA (m, n)=A (m+X1-1, n+Y1-1).
Wherein, step S162 is specifically included:
Step S1621, if all values are all 1, only one of which form in electronic form document, business form coordinate in LA It is [X1, X2, Y1, Y2].
Step S1622, if existing in LA containing 0 value, X1 row, the cell of Y1 rows in detection electronic form document Whether it is empty, if cell is not sky, detects remaining element lattice to the right always, until detecting mentioned null cell, record is empty single The row coordinate of first lattice is X3.
Wherein, represent in electronic form document there are multiple independent business forms containing 0 value if existing in LA, from step S1622 starts to be exactly the method for extracting multiple independent business forms.
Step S1623, whether the cell for detecting X1 row from the top down is empty, until detecting mentioned null cell, record The row coordinate of mentioned null cell is the maximum row coordinate of X1 row, continues to detect next column, until having detected that X3 is arranged.
Step S1624, if maximum is Y3 in all maximum row coordinates, business form coordinate is [X1, X3, Y1, Y3], Content in LA with [X1, X3, Y1, Y3] opposite position is set to 0, new LA is obtained.
Wherein, after extracting a business form, need to be in LA by the corresponding zeros data of business form, to find Next form, because when next round extracts business form if not resetting, or can only find first business form.
Step S163, return to step step S161, according in the LA acquisition electronic form documents after being updated in step S1624 Business form coordinate, all business forms in extracting electronic form document.
The business form coordinate for being obtained in management process S1 for convenience, pre-builds a List object (referred to as PList), what is deposited in PList is the one-dimension array that length is 4, for storage service form coordinate.In business form coordinate Four elements represent business form first row, last position of row, the first row, last column in electronic form document successively Put, therefore, business form can be extracted from electronic form document according to business form coordinate.
The preferred embodiment of step S2 specifically includes following steps:
Step S21, the title division in detection business form.
As shown in Fig. 2 form caption part is generally all a big Merge Cells, it may be possible to a line or multirow, table Lattice can also include remarks section, and the structure of remarks section is similar with title division, remove title division and the remarks portion of form Point, remainder is exactly the business datum for needing to extract.When there is remarks section in form, detection is also needed in step S21 Remarks section, detection mode is identical with the detection mode of title division.
Step S22, except the various dimensions information of title division in extraction business form.
Wherein, various dimensions information includes:Cell content, background colour attribute etc..Cell content is the text in cell This content, such as " name ", " age ", " Zhang San ", " 40 ".Background colour attribute defines the background color of cell.
Step S23, table-layout is judged according to the various dimensions information for extracting.
Wherein, common table-layout is divided into laterally list TL, laterally many TL, longitudinal list TL, many TL in longitudinal direction, many table packs. TL (TitleLine) is row head (or data header part) (may be physically multirow, but be in logic a region), is represented The first row of the business datum part in the title of every business datum, such as Fig. 2 and be TL.TL can be horizontal, it is also possible to It is longitudinal, is illustrated in figure 3 many TL layouts in longitudinal direction, Fig. 4 is horizontal many TL layouts.
Title division and remarks section are general all in the first row or the second row of form, and are a Merge Cellses, because This, the specific implementation of step S21 includes:Whether every a line is a Merge Cells in detection business form, if so, then Detected row belongs to title division, and carries out the detection of next line;If it is not, representing that the row is initially business datum, then stop The only detection of title division.
Prior art, when hash (such as title division, remarks section) is filtered, it is necessary to be known a priori by hash Position, then specifies good position to skip first few lines hash in a program.And the method for step S21 is more in the present embodiment With versatility, no matter how many row headers part of form and remarks section, accurately and efficiently it can be detected, with Guarantee to be drawn into business datum exactly.
During extracted data, in addition to direct access standard element lattice corresponding informance, in addition it is also necessary to which Merge Cells is done Especially treatment, makes the data of extraction meet storage format, facilitates subsequent treatment, therefore, the preferred embodiment of step S22 includes:Take out Take except the various dimensions information of title division (have remarks section if, also including remarks section) in business form, to being extracted letter After Merge Cells in breath is split, then the information of each dimension is stored in two-dimensional array form respectively, and to splitting Cell do special marking.
Wherein, Merge Cells is divided into horizontal meaders, vertical consolidation, mixing merging again.5 lists by a horizontal meaders The Merge Cells of first lattice obtains following result after splitting:
ABC {←} {←} {←} {←}
Wherein, special marking " { ← } " is that extraction unit lattice content is distinctive, represents that the content in the cell is left with it Content in the cell of side is identical, in order to treatment and final content the output offer flexibility to TL, and other data Extraction need not do special marking.
The background colour attribute of extraction is:
#F7FBFE #F7FBFE #F7FBFE #F7FBFE #F7FBFE
It is as shown in the table when single file has multiple horizontal meaders, in addition it is also necessary to note the problem of coordinate translation:
ABC {←} DEF {←} {←}
Merge for vertical consolidation, mixing and be also adopted by similar method and carry out data pick-up.
Only know table-layout, could exactly extract business datum, and be converted into form according to table-layout Structural data.Judgement table-layout in step S23 includes following several operations:
(1) according to the cell content for extracting, exclusion is not the row and column of TL.
Data type, length, keyword in TL cells carry out removing property judgement.Its basis for estimation includes:TL Field name length in each cell can not possibly exceed threshold value (such as 50), and the field name number of TL can not possibly exceed threshold value (such as 1000), field name is unlikely to be pure digi-tal character string, common field name include " title ", " Name ", " address ", The keyword such as " Address ", " address ", " type ", " remarks ", keywords database, detection row are obtained according to common table statistics Or whether comprising the keyword in keywords database in row.
Therefore, the step that implements for carrying out table-layout judgement based on cell content is:Line by line, detection is extracted by column Cell content;If the data type of cell content is numeric type character string, row or column where cell is not just TL; If row or column where the field length of cell content exceedes first threshold, cell is not just TL;If a row or column Comprising given keyword in multinomial cell content, then row or column is TL.
Wherein, when using the determination methods for being based on keyword, in order to ensure to judge reliability, at least need occur two Keyword is just it can be assumed that the row or column is TL.
(2) according to the background colour attribute for extracting, table-layout is judged.
When form shows, in order to provide the user with the convenience of reading, the background colour meeting of the background colour and data of form TL Difference, or the parity rows of data can use background colour staggeredly, therefore, background colour attribute can be used to which row or column judged It is probably TL, and then judges that table-layout is transverse direction or longitudinal direction.
(3) whether according to identical with the data type of the cell content in a line or same row, table-layout is judged.
TL parts are removed in the business datum of form, the data of the cell under each field name of TL are so long as not null value , their type should all be that (the method can only distinguish ' pure digi-tal ocra font ocr string ', ' date-time ocra font ocr to identical String ', ' without obvious characteristic character string ').Such as form in Fig. 2 is landscape layout, wherein business datum part, except the first row TL, the data type of remaining each column unit lattice is all identical, and such as field name " sequence number " this row are all pure digi-tal ocra font ocrs String, field name " performing law court " this row are all ' without obvious characteristic character string ' that field name " execution Reference Number " this row are all ' nothings Obvious characteristic character string ', in a word, in addition to TL rows, the data type of each row is all identical.
According to above-mentioned characteristic, detect whether all identical with the data type of a line, if the data type of all rows of the form All identical (either all it is ' pure digi-tal ocra font ocr string ' i.e. with the data type of all cells in a line, or all it is ' during the date Between ocra font ocr string ' or ' without obvious characteristic character string '), then the form for longitudinal direction be laid out;Whether the data type of detection same row It is all identical, the data type of all row of form it is all identical (or the data type of all cells is all ' cardinar number i.e. in same row Font character string ', or be all ' date-time ocra font ocr string ' or ' without obvious characteristic character string '), then the form is horizontal cloth Office.
Can be present the situation of null value in some cells, to avoid these cells from influenceing testing result, done to row and column During detection, content is that empty cell does not include detection range.
The data volume of form business datum part is typically more, and all row and columns are all carried out with detection can reduce judgement effect Rate, it is therefore possible to use short circuit judges, if the judged result of that is, new a line can negate certain layout, can jump out and sentence It is disconnected.
After above-mentioned several methods for judging table-layout can carry out various combinations according to the actual requirements, table-layout is entered Row judgement, improves judging nicety rate;In addition, the method for the present embodiment is capable of identify that the situation of many TL in form, improves and extract The reliability of data.
When table-layout is longitudinal direction layout, in addition it is also necessary to which the form transposition for forming cell content is horizontal cloth Office.
TL is divided into single-stage TL, two kinds of multistage TL again, but in the case where illustrating, is all referred to as TL.Such as Fig. 2 institutes Show, only one of which TL and be single-stage TL.As shown in fig. 7, only one of which TL and for multistage TL (is made up of multirow, there is the superior and the subordinate Membership), the field name in this multirow need to be merged, form the output of single file field name.As shown in fig. 7, in former form TL points is two parts, and left-hand component is multirow (multistage), and right-hand component is single file, and the first order of multistage part is combining unit Lattice, field is entitled ' essential information ', and the second level of multistage part is ' name ', ' age ', ' sex ' field, final output single-stage TL, its structure is " essential information _ name ", " essential information _ age ", " essential information _ sex ", " other fields A ", " its He is field B ".
When table-layout is many TL, in addition it is also necessary to do cutting union operation to the form that cell content is formed, The layout of single TL is converted into, to meet the call format of structural data.Cutting union operation includes:Compare multiple TL's Cell content;Content identical TL only retains a line TL, as shown in Figure 5;The different TL of content is spliced into TL in a row, is such as schemed Shown in 6.
Finally, for Merge Cells, special marking can be corrected according to service needed.For example
ABC {←} {←} {←} {←}
Can be adjusted to following form:
ABC ABC ABC ABC ABC
The method of structured message is equally applicable to the feelings comprising multiple sheet tabs in above-mentioned extraction electronic form document Condition, specific method is:The sheet tabs in electronic form document are obtained one by one, and step S1 is used to each sheet tab respectively The method of~step S3 extracts the business form in each sheet tab.
Based on the method identical inventive concept with structured message in above-mentioned extraction electronic form document, the present embodiment is also There is provided it is a kind of extract electronic form document in structured message device, as shown in Figure 10, including:Business form obtains mould Block, for obtaining all business forms in electronic form document;Table-layout analysis module, for being laid out to business form Analysis;Form data abstraction module, for extracting content from business form according to topological analysis's result, and does corresponding conversion Treatment obtains structured message.
The device of structured message in the extraction electronic form document that the present embodiment is provided, by isolated Table recognition algorithm All independent business forms in electronic form document can be obtained with automatic batch, the effect of large-scale data extraction is improve Rate;Business datum is extracted again after analysis is laid out to business form, the reliability of extracted data is improve, especially to big When the identification of scale semi-structured data and extraction more effectively.
Wherein, business form acquisition module specifically for:Set up two two dimensions of size identical with electronic form document Bit arrays, are designated as A and B;All cells in traversal electronic form document, if there is content in cell, corresponding positions in A Tagging is 1, otherwise labeled as 0;All cells in traversal electronic form document, the frame line according to cell enters to B Line flag;If the value in B is 1, the value of correspondence position is set to 1 in A;According in the A acquisition electronic form documents after renewal Business form coordinate.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that:Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover in the middle of the scope of claim of the invention and specification.

Claims (10)

1. it is a kind of extract electronic form document in structured message method, it is characterised in that including:
By all business forms in isolated Table recognition algorithm acquisition electronic form document;
Analysis is laid out to the business form;
Content is extracted from the business form according to topological analysis's result, and does corresponding conversion process and obtain structuring letter Breath.
2. method according to claim 1, it is characterised in that described that electrical form is obtained by isolated Table recognition algorithm All business forms in document, including:
The two two dimension bit arrays of size identical with the electronic form document are set up, A and B is designated as;
The all cells in the electronic form document are traveled through, if there is content in cell, relevant position is labeled as 1 in A, Otherwise it is labeled as 0;
The all cells in the electronic form document are traveled through, the frame line according to cell is marked to B;
If the value that the value in B is same position in 1, A is set to 1;
Business form coordinate in the electronic form document is obtained according to the A after renewal.
3. method according to claim 2, it is characterised in that all units in the traversal electronic form document Lattice, the frame line according to cell is marked to B, including:
The all cells in the electronic form document are traveled through, if at least one corner of four corners of cell has two Bar frame line, then relevant position is labeled as 1 in B.
4. method according to claim 3, it is characterised in that all units in the traversal electronic form document Lattice, if at least one corner of four corners of cell has two frame lines, after relevant position is labeled as 1 in B, also wrap Include:
Step S132, travels through all cells in the electronic form document, if cell is present on frame line, and B again Respective value is 0, and value of adjacent with the cell four cells up and down in B at least one be marked as 1, Then the position mark by the cell in B is 1;
Step S133, travels through all cells in the electronic form document again, if cell respective value on B is 0, and In comprising the cell 2 × 2 region, other three cells respective values on B are all 1, then marked on B described Cell is 1, and counter adds 1;
Step S134, if the counter is not 0, the counter O reset re-executes step S133.
5. method according to claim 2, it is characterised in that the A according to after renewal obtains the electrical form text Business form coordinate in shelves, including:
Reduction operation is carried out to the A after renewal, LA is obtained;
Business form coordinate business form coordinate in the electronic form document is obtained according to LA.
6. method according to claim 5, it is characterised in that described pair update after A carry out reduction operation, obtain LA, Including:
From the leftmost side of A begin stepping through A in all of row, if there is 1 value in row, record the row coordinate X1 of row, terminate time Go through;
From the rightmost side of A begin stepping through A in all of row, if there is 1 value in row, record the row coordinate X2 of row, terminate time Go through;
From the top side of A begin stepping through A in all of row, if there is 1 value in row, record capable row coordinate Y1, terminate time Go through;
From the lower side of A begin stepping through A in all of row, if there is 1 value in row, record capable row coordinate Y2, terminate time Go through;
The data of [X1, X2, Y1, Y2] position in A are extracted, two dimension bit array LA is formed, and determine according to X1, X2, Y1, Y2 The coordinate mapping relations of LA and A.
7. method according to claim 6, it is characterised in that the industry obtained according to LA in the electronic form document Business form coordinate, including:
If in LA all values all be 1, only one of which form in the electronic form document, business form coordinate for [X1, X2, Y1,Y2];
Otherwise, detect that X1 is arranged, whether the cell of Y1 rows is empty in the electronic form document, if cell is not sky, Then detect remaining element lattice to the right always, until detecting mentioned null cell, the row coordinate for recording mentioned null cell is X3,
Whether the cell for detecting X1 row from the top down is empty, until detecting mentioned null cell, the row seat of record mentioned null cell The maximum row coordinate of X1 row is designated as, continues to detect next column, until having detected that X3 is arranged,
If in all maximum row coordinates maximum be Y3, business form coordinate be [X1, X3, Y1, Y3], by LA with [X1, X3, Y1, Y3] content of opposite position is set to 0, obtains new LA;
Business form coordinate in the electronic form document is obtained according to the LA after renewal, until extracting the electrical form All business forms in document.
8. method according to claim 1, it is characterised in that described that analysis is laid out to the business form, including:
Detect the title division in the business form;
Extract in the business form except the various dimensions information of title division;
The various dimensions information according to extracting judges table-layout.
9. it is a kind of extract electronic form document in structured message device, it is characterised in that including:
Business form acquisition module, for by all business forms in isolated Table recognition algorithm acquisition electronic form document;
Table-layout analysis module, for being laid out analysis to the business form;
Form data abstraction module, for extracting content from the business form according to topological analysis's result, and does corresponding Conversion process obtains structured message.
10. device according to claim 9, it is characterised in that the business form acquisition module specifically for:
The two two dimension bit arrays of size identical with the electronic form document are set up, A and B is designated as;
The all cells in the electronic form document are traveled through, if there is content in cell, relevant position is labeled as 1 in A, Otherwise it is labeled as 0;
The all cells in the electronic form document are traveled through, the frame line according to cell is marked to B;
If the value that the value in B is same position in 1, A is set to 1;
Business form coordinate in the electronic form document is obtained according to the A after renewal.
CN201611245472.9A 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document Active CN106709032B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611245472.9A CN106709032B (en) 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611245472.9A CN106709032B (en) 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document

Publications (2)

Publication Number Publication Date
CN106709032A true CN106709032A (en) 2017-05-24
CN106709032B CN106709032B (en) 2019-12-20

Family

ID=58904022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611245472.9A Active CN106709032B (en) 2016-12-29 2016-12-29 Method and device for extracting structured information in electronic form document

Country Status (1)

Country Link
CN (1) CN106709032B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170697A (en) * 2017-07-12 2018-06-15 信号旗智能科技(上海)有限公司 A kind of international trade document handling method, system and a kind of server
CN110377604A (en) * 2019-07-23 2019-10-25 北京小米移动软件有限公司 A kind of method, apparatus and medium for extracting form data
CN110489423A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110888965A (en) * 2019-10-22 2020-03-17 深圳市迪博企业风险管理技术有限公司 Document data extraction method and device
CN110889310A (en) * 2018-09-07 2020-03-17 上海怀若智能科技有限公司 Financial document information intelligent extraction system and method
CN110968667A (en) * 2019-11-27 2020-04-07 广西大学 Periodical and literature table extraction method based on text state characteristics
CN110969000A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Data merging processing method and device
CN111966734A (en) * 2020-03-30 2020-11-20 北京来也网络科技有限公司 Data processing method and electronic equipment of spreadsheet combined with RPA and AI
CN112307030A (en) * 2020-11-05 2021-02-02 金蝶软件(中国)有限公司 Dimension combination obtaining method and related equipment
CN112328589A (en) * 2020-11-28 2021-02-05 河北省科学技术情报研究院(河北省科技创新战略研究院) Electronic form data granulation and index standardization processing method
CN112381143A (en) * 2020-11-13 2021-02-19 长城计算机软件与系统有限公司 Variable automatic classification method and system based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620593A (en) * 2008-06-30 2010-01-06 国际商业机器公司 Resolve the method and the electronic form server of the content of electronic spreadsheet
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103279455A (en) * 2013-06-28 2013-09-04 中国农业银行股份有限公司 Spreadsheet style processing method and device
CN104731813A (en) * 2013-12-23 2015-06-24 珠海金山办公软件有限公司 Form file display method and system
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620593A (en) * 2008-06-30 2010-01-06 国际商业机器公司 Resolve the method and the electronic form server of the content of electronic spreadsheet
CN103198069A (en) * 2012-01-06 2013-07-10 株式会社理光 Method and device for extracting relational table
CN103279455A (en) * 2013-06-28 2013-09-04 中国农业银行股份有限公司 Spreadsheet style processing method and device
CN104731813A (en) * 2013-12-23 2015-06-24 珠海金山办公软件有限公司 Form file display method and system
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170697B (en) * 2017-07-12 2021-08-20 信号旗智能科技(上海)有限公司 International trade file processing method and system and server
CN108170697A (en) * 2017-07-12 2018-06-15 信号旗智能科技(上海)有限公司 A kind of international trade document handling method, system and a kind of server
CN110889310B (en) * 2018-09-07 2023-05-09 深圳市赢时胜信息技术股份有限公司 Financial document information intelligent extraction system and method
CN110889310A (en) * 2018-09-07 2020-03-17 上海怀若智能科技有限公司 Financial document information intelligent extraction system and method
CN110969000A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Data merging processing method and device
CN110377604A (en) * 2019-07-23 2019-10-25 北京小米移动软件有限公司 A kind of method, apparatus and medium for extracting form data
CN110377604B (en) * 2019-07-23 2022-06-24 北京小米移动软件有限公司 Method, device and medium for extracting form information
CN110489423A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of information extraction, storage medium and electronic equipment
CN110489423B (en) * 2019-08-26 2021-10-08 北京香侬慧语科技有限责任公司 Information extraction method and device, storage medium and electronic equipment
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN110888965A (en) * 2019-10-22 2020-03-17 深圳市迪博企业风险管理技术有限公司 Document data extraction method and device
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN110968667A (en) * 2019-11-27 2020-04-07 广西大学 Periodical and literature table extraction method based on text state characteristics
CN110968667B (en) * 2019-11-27 2023-04-18 广西大学 Periodical and literature table extraction method based on text state characteristics
CN111966734A (en) * 2020-03-30 2020-11-20 北京来也网络科技有限公司 Data processing method and electronic equipment of spreadsheet combined with RPA and AI
CN112307030A (en) * 2020-11-05 2021-02-02 金蝶软件(中国)有限公司 Dimension combination obtaining method and related equipment
CN112307030B (en) * 2020-11-05 2023-12-26 金蝶软件(中国)有限公司 Dimension combination acquisition method and related equipment
CN112381143A (en) * 2020-11-13 2021-02-19 长城计算机软件与系统有限公司 Variable automatic classification method and system based on machine learning
CN112381143B (en) * 2020-11-13 2023-12-05 新长城科技有限公司 Automatic variable classification method and system based on machine learning
CN112328589A (en) * 2020-11-28 2021-02-05 河北省科学技术情报研究院(河北省科技创新战略研究院) Electronic form data granulation and index standardization processing method

Also Published As

Publication number Publication date
CN106709032B (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN106709032A (en) Method and device for extracting structured information from spreadsheet document
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
US7937338B2 (en) System and method for identifying document structure and associated metainformation
US20150095769A1 (en) Layout Analysis Method And System
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN110516221B (en) Method, equipment and storage medium for extracting chart data in PDF document
CN106777259A (en) The method and device of structured message in adaptive decimation HTML Table labels
CN103761221B (en) System and method for identifying sensitive text messages
WO2010019804A2 (en) Segmenting printed media pages into articles
CN111582169A (en) Image recognition data error correction method, device, computer equipment and storage medium
Al-Zaidy et al. Automatic summary generation for scientific data charts
CN109492177B (en) web page blocking method based on web page semantic structure
CN106844482B (en) Search engine-based retrieval information matching method and device
CN104252616A (en) Human face marking method, device and equipment
CN112084451B (en) Webpage LOGO extraction system and method based on visual blocking
CN107895117A (en) Malicious code mask method and device
Colter et al. Tablext: A combined neural network and heuristic based table extractor
CN106777281A (en) For improving web crawlers stability, the data processing method of availability and device
CN113962201A (en) Document structuralization and extraction method for documents
CN103218420A (en) Method and device for extracting page titles
US20140181124A1 (en) Method, apparatus, system and storage medium having computer executable instrutions for determination of a measure of similarity and processing of documents
TWI396990B (en) Citation record extraction system and method, and program product
Bartík Text-based web page classification with use of visual information
He et al. Bar charts detection and analysis in biomedical literature of PubMed Central
CN108170838B (en) Topic evolution visualization display method, application server and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 518000 2203/2204, Building 1, Huide Building, Beizhan Community, Minzhi Street, Longhua District, Shenzhen, Guangdong

Patentee after: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

Address before: 518000 units J and K, 12 / F, block B, building 7, Baoneng Science Park, Qinghu Industrial Zone, Qingxiang Road, Longhua New District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN AUDAQUE DATA TECHNOLOGY Ltd.

CP02 Change in the address of a patent holder