Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation
Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common
The application protection all should belong in technical staff's every other embodiment obtained without making creative work
Range.
The embodiment of the present application provides a kind of method and device of Extracting Information from table, to solve the prior art from table
When obtaining information in lattice, the huge and inefficient problem of workload.
Here is the present processes embodiment.
The present processes embodiment provides a kind of method of Extracting Information from table.Fig. 1 is that this takes out from table
Win the confidence breath method flow chart.It is a variety of that this method can be applied to server, PC (PC), tablet computer, mobile phone etc.
In equipment.
As shown in Figure 1, method includes the following steps:
Step S101, analyzing web page source code extract the table generation in webpage according to the form tag in web page source code
Code.
Web page source code is usually constructed by hypertext markup language (HyperText Markup Language, HTML).
HTML is a kind of for creating the standard markup language of webpage, often with cascading style sheets (Cascading Style Sheets,
CSS), JavaScript is used to design the user of webpage, Web page application program and mobile applications by numerous websites together
Interface.And all kinds of tables in webpage are usually to be realized by html language.
In HTML, table is defined by<table>label.Each table has several rows (by<tr>tag definition),
Every row is divided into several cells (by<td>tag definition).Letter t d refers to list data (table data), i.e. data sheet
The content of first lattice.Therefore, by parsing HTML code, form tag<table>is got, it will be able to extract from HTML code
Table code out.
Illustratively, one section of HTML code comprising table code is as follows:
<html>
<body>
<h1>address list</h1>
<table>
<tr>
<th>name</th>
<thcolspan="2">phone</th>
</tr>
<tr>
<td>Bill Gates</td>
<td>555 77 854</td>
<td>555 77 855</td>
</tr>
</table>
</body>
</html>
Wherein, form tag<table>with</table>code between (end-tag of table) is table code
(part of following marking in code).The table that above table code is shown in webpage is as follows:
In one embodiment, in order to meet the specific functional demand or business demand of department, mechanism or enterprise, technology
Personnel can be used orientation crawler and crawl web page source code from specified webpage.Such as: the contractor of building trade applies
Work enterprise can discharge crawler in specified website, and web page source code is crawled in monitoring site content update, or timing is climbed
Take web page source code.
In one embodiment, user can manually select the source code for parsing which webpage.Such as: webpage is browsed in user
Equipment on installation for realizing this method application program, when user wishes the Extracting Information from the table of some webpage,
User the network address of this webpage can be inputted (input mode includes but is not limited to: keyboard input, text duplication paste, text
Dragging, optical character identification etc.) into application program, make application program analyzing web page source code and from extract table code.
In some scenes, the table in webpage may be made of multiple table nestings, such as: some webpages are being compiled
Just various information is shown in the form of multiple nested tables when writing.However lead in webpage there are when the table of multiple nestings
Often the table of only innermost layer is only the table as worth of data publicity.Therefore, in one embodiment, step S101 is logical
Analyzing web page source code is crossed, after finding form tag, also judges whether form tag has multilayer nest relationship, if table
There are multilayer nest relationships for label, then the corresponding table code of form tag of innermost layer are extracted, if form tag is not present
Multilayer nest relationship then directly extracts all forms code.Reduce the range of the Extracting Information from table as a result, improves and extract
Efficiency and accuracy.
Step S102, according in the table code cell inter-bank attribute value and across Column Properties values analysis table whether
Comprising the Merge Cells across multirow or multiple row, if comprising the Merge Cells are decomposed into multiple cells.
In HTML, the inter-bank attribute of table and determined respectively by two attributes of rowspan and colspan across Column Properties
Justice.Such as: rowspan=" 2 " indicates that inter-bank attribute value is " 2 ", i.e. cell crosses over two rows;Colspan=" 2 " is indicated across column
Attribute value is " 2 ", i.e. cell crosses over two rows.
Due to above-mentioned inter-bank or the cell across column exists, and the cell structure of table becomes irregular, is unfavorable for information
Extraction, and be also possible to reduce the accuracy collected of information.The embodiment of the present application in step s 102, in order to guarantee the application
Embodiment extract form data accuracy, if according in table code cell inter-bank attribute value and across Column Properties values it is true
Determining table includes Merge Cells, then decomposes Merge Cells, the line number and columns for occupying cell all in table
It is all 1.
Illustratively, a table comprising Merge Cells are as follows:
The table obtained after Merge Cells are decomposed are as follows:
Ranking |
Organization |
Offer by tender |
Bottled water |
Bottled water |
Branch fills water |
Branch fills water |
Ranking |
Organization |
Offer by tender |
Valence containing duty receipt |
Without duty receipt valence |
Valence containing duty receipt |
Without duty receipt valence |
1 |
A trade Co., Ltd |
200500 |
17.55 |
15 |
1.25 |
1.07 |
2 |
Beverage B Co., Ltd |
198900 |
17.55 |
15 |
1.17 |
1 |
3 |
C beverage Co., Ltd |
180000 |
15 |
12.82 |
1.5 |
1.28 |
It in one embodiment, can also be to each list in table after being decomposed to the Merge Cells of table
First lattice carry out position mark, that is, mark the row coordinate at the place of each cell and the column coordinate at place, and can be formed as follows
Table:
Content |
Row coordinate |
Column coordinate |
Ranking |
1 |
1 |
Organization |
1 |
2 |
Offer by tender (member) |
1 |
3 |
Bottled water |
1 |
4 |
Bottled water |
1 |
5 |
……
The position of each cell in table is accurately positioned in a manner of row coordinate and column coordinate as a result, is convenient for
Positional relationship in subsequent step between analytical unit lattice improves information extraction efficiency.
Step S103 carries out rule analysis to each cell in table, to obtain gauge outfit cell and corresponding table
Head classification.
In one embodiment, the content of each cell can be sent to the analysis model pre-established, by analyzing
Model analyzes cell content according to its built-in analysis rule, to identify which cell is gauge outfit unit
Lattice, which cell are non-gauge outfit cells, and determine the corresponding gauge outfit classification of gauge outfit cell.Wherein, gauge outfit cell is
Refer to that the cell where the Table Header information in table, non-gauge outfit cell refer to the unit where the non-Table Header information in table
Lattice.
In one embodiment, the extraction tree pre-established can be used to analyze each cell in table.
Illustratively, one may include with flowering structure for " bid publicity " extraction tree that the table of class is established:
Call for bid publicity
+ character analysis
Table gauge outfit analyzes -- gauge outfit node --
Highest bidder's information gauge outfit -- gauge outfit classification child node --
It gets the bid serial number -- leaf node --
Highest bidder/candidate ...
Middle marked price
Acceptance of the bid content
Project information gauge outfit -- gauge outfit classification child node --
Bid inviter's -- leaf node --
Project name ...
It calls for bid the time
Bidding number
Tender agent
The assessment of bids date
Assessment of bids place
Bid evaluation committee's composition
Competitive bid unit contact person
Competitive bid unit telephone number
Agency contact person
Agency contact address
Agency's telephone number
The publicity beginning and ending time
The publicity time started
Publicity deadline
+ indirect input gauge outfit -- gauge outfit classification child node --
Other gauge outfits
It sees example above, extracting tree includes gauge outfit node, and gauge outfit node includes multiple gauge outfit classification child nodes, Mei Gebiao
Head classification child node includes multiple leaf nodes.
In one embodiment, at least one expression formula can be set in leaf node, which is used for and cell
Content matched, whether be gauge outfit cell with determination unit lattice, and, determine the corresponding gauge outfit class of gauge outfit cell
Not.For example, an expression formula in leaf node " highest bidder/candidate " has been matched to cell content " organization ", then
It can determine that the cell where " organization " is gauge outfit cell, corresponding gauge outfit classification is " highest bidder/candidate ".
In one embodiment, gauge outfit classification can further include gauge outfit group and gauge outfit major class.Wherein, gauge outfit group
The title of target leaves node belonging to the expression formula being matched to for cell, gauge outfit major class are mesh belonging to target leaves node
Mark the title of gauge outfit classification child node.For example, the corresponding gauge outfit group of gauge outfit cell " organization " is " highest bidder/candidate
People ", gauge outfit major class are " highest bidder's information ".
In one embodiment, the rule analysis to each cell is as a result, the table of following form can be organized into:
Cell content |
Row coordinate |
Column coordinate |
Whether gauge outfit |
Gauge outfit classification |
Ranking |
1 |
1 |
It is |
Acceptance of the bid serial number |
Organization |
1 |
2 |
It is |
Highest bidder/candidate |
Offer by tender |
1 |
3 |
It is |
Middle marked price |
Bottled water |
1 |
4 |
It is |
Other |
…… |
…… |
…… |
…… |
…… |
A trade Co., Ltd |
3 |
2 |
It is no |
|
200500 |
3 |
3 |
It is no |
|
……
The row coordinate of gauge outfit cell and column coordinate are accurately positioned as a result, are conducive to determine gauge outfit in the next steps
Cell and non-gauge outfit cell possess and control relationship.
Step S104 determines non-table according to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table
Head unit lattice possess and control gauge outfit cell.
In general, the gauge outfit cell of possessing and control of non-gauge outfit cell is usually the non-gauge outfit cell in a table
With first gauge outfit cell on the left of a line, alternatively, the gauge outfit cell above same row.
In one embodiment, in order to which determine non-gauge outfit cell possess and control gauge outfit cell, non-table can be obtained first
The row coordinate and column coordinate of head unit lattice and gauge outfit cell;Then non-gauge outfit unit is searched according to the row coordinate and column coordinate
The left side of lattice or the gauge outfit cell of top, and possess and control gauge outfit for what the gauge outfit cell found was determined as non-gauge outfit cell
Cell.
Illustratively: the position coordinates of non-gauge outfit cell " A trade Co., Ltd " are [2,3], since its left side is without same
Capable gauge outfit cell, thus can upwards along same row look-up table head unit lattice, thereon side can find position coordinates be [1,
2] and the identical gauge outfit cell " organization " of two contents of [2,2] (from Merge Cells decomposition), therefore gauge outfit
Cell " organization " is just that non-gauge outfit cell " A trade Co., Ltd " possess and control gauge outfit cell.This possess and control gauge outfit list
First lattice corresponding node path in extracting tree are as follows: bid publicity-table gauge outfit analysis-highest bidder's information gauge outfit-highest bidder/time
It chooses.Therefore this possess and control the corresponding gauge outfit group of gauge outfit cell be " highest bidder/candidate ", corresponding gauge outfit major class be " in
Mark people's information gauge outfit ".
Possess and control the method for gauge outfit cell according to above-mentioned determination, successively the non-gauge outfit cell of a line every in table is carried out
Analysis, determine each non-gauge outfit cell possess and control gauge outfit cell.
In one embodiment, the implementing result of step S104 can form following table:
Non- gauge outfit cell |
Possess and control gauge outfit cell |
Affiliated group |
Affiliated major class |
Row coordinate |
1 |
Ranking |
Acceptance of the bid serial number |
Highest bidder's information gauge outfit |
3 |
A trade Co., Ltd |
Organization |
Highest bidder |
Highest bidder's information gauge outfit |
3 |
200500 |
Offer by tender |
Middle marked price |
Highest bidder's information gauge outfit |
3 |
17.55 |
Nothing |
Nothing |
Nothing |
3 |
15 |
Nothing |
Nothing |
Nothing |
3 |
1.25 |
Nothing |
Nothing |
Nothing |
3 |
Since gauge outfit cell, institute is all not present on the left of same a line of non-gauge outfit cell " 17.55 " and above same row
With non-gauge outfit cell " 17.55 " possess and control gauge outfit cell there is no corresponding;Therefore, in above table, " 17.55 " and " neck
The cell of category gauge outfit cell " " affiliated group " and " affiliated major class " intersection inserts "None".Similarly, " 15 " " 1.25 " with
The cell of " possessing and control gauge outfit cell " " affiliated group " and " affiliated major class " intersection also inserts "None".
Step S105, according to non-gauge outfit cell with possess and control gauge outfit cell corresponding relationship and corresponding gauge outfit class
Other output rule, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.
Specifically, the content in non-gauge outfit cell and gauge outfit cell will export in pairs according to their relationship of possessing and control,
In output, according to the output of the corresponding gauge outfit classification of gauge outfit cell rule, determination specifically exports those contents, and, with
Which type of format exports these contents.
In some embodiments, those skilled in the art can be according to the specific field allocation list of application the embodiment of the present application
The output rule of head classification.The example of a bid publicity table configuration output rule for building field is given below, with
For those skilled in the art understand that the embodiment of the present application configures the technical concept of output rule, and reference and reference above-mentioned example reality
Apply the technical solution of the embodiment of the present application.
Illustratively, the table of information to be extracted is as follows:
Step S101- step S104 is executed to above table, following table can be obtained:
Cell content |
Affiliated gauge outfit group |
Affiliated gauge outfit major class |
Row |
The design of the new district 2019XX Bureau of Education afforestation design improvement project |
Project name |
Project information gauge outfit |
1 |
SF20199999 |
Bidding number |
Project information gauge outfit |
1 |
The city XX, Bureau of Education's engineering management center, the new district XX |
Bid inviter |
Project information gauge outfit |
2 |
XX Project Management Service Co., Ltd |
Tender agent |
Project information gauge outfit |
3 |
Wang Gong |
Contact person |
Indirect output gauge outfit |
4 |
88888888 |
Telephone number |
Indirect output gauge outfit |
4 |
3 building, the new district the XX road XX 357 |
Contact address |
Indirect output gauge outfit |
5 |
100000 |
Other gauge outfits |
Other gauge outfits |
5 |
Competitive bidding |
Other gauge outfits |
Other gauge outfits |
6 |
1 |
Acceptance of the bid serial number |
Highest bidder's information gauge outfit |
8 |
XX landscape industry development Co., Ltd |
Highest bidder/candidate |
Highest bidder's information gauge outfit |
8 |
2 |
Acceptance of the bid serial number |
Highest bidder's information gauge outfit |
9 |
XX afforests artistic development Co., Ltd |
Highest bidder/candidate |
Highest bidder's information gauge outfit |
9 |
Note: 2 bidders are as acceptance of the bid candidate before ranking. |
Acceptance of the bid serial number |
Highest bidder's information gauge outfit |
10 |
On October 8th, 2019, China X X bid net |
Other gauge outfits |
Other gauge outfits |
11 |
On November 5,1 day to 2019 November in 2019 |
The publicity beginning and ending time |
Project information gauge outfit |
12 |
Xu XX, X, week XX, Liu XX, river XX |
Other gauge outfits |
Other gauge outfits |
13 |
For above-mentioned gauge outfit major class and gauge outfit group, it is possible to specify following output rule:
The affiliated gauge outfit major class of cell is " other gauge outfits ", is not exported;
The affiliated gauge outfit major class of cell is " sundry item gauge outfit ", and cell content directly exports;
The affiliated gauge outfit major class of cell is " indirect output gauge outfit ", is expert at and is found positioned at cell in cell
Before nearest " project information gauge outfit ", and according to the content of for example following adduction relationship output unit lattice:
Contact person+tender agent=agency contact person
Contact address+tender agent=agency contact address
Telephone number+tender agent=agency's telephone number
Contact person+bid inviter=competitive bid unit contact person
Contact address+bid inviter=competitive bid unit contact address
Telephone number+bid inviter=competitive bid unit telephone number
According to above-mentioned output rule, step S105 can for example be extracted from the table of information to be extracted and be exported in following
Hold:
Project name: the new district 2019XX Bureau of Education afforestation design improvement project design
Bidding number: SF20199999
Bid inviter: the city XX, Bureau of Education's engineering management center, the new district XX
Tender agent: XX Project Management Service Co., Ltd
Competitive bid unit contact person: Wang Gong
Competitive bid unit telephone number: 88888888
Competitive bid unit contact address: 3 building, the new district the XX road XX 357
The publicity beginning and ending time: on November 5,1 to 1019 year November in 2019
Highest bidder or acceptance of the bid candidate message:
Acceptance of the bid serial number: 1
Highest bidder/candidate: XX landscape industry development Co., Ltd
Acceptance of the bid serial number: 2
Highest bidder/candidate: XX afforests artistic development Co., Ltd
As a result, in web page form valuable information according to output rule be extracted output, can replace it is existing manually
The mode of browsing avoids the workload for generating huge manual work, improves extraction efficiency.
From the above technical scheme, the embodiment of the present application provides a kind of method of Extracting Information from table.The party
Method includes: analyzing web page source code, extracts the table code in webpage according to the form tag in web page source code;According to table
Whether cell inter-bank attribute value in code and across Column Properties values analysis table include combining unit across multirow or multiple row
Lattice, if comprising Merge Cells are decomposed into multiple cells;Rule analysis is carried out to each cell in table, to obtain
Take gauge outfit cell and corresponding gauge outfit classification;It closes position according to the non-gauge outfit cell in table relative to gauge outfit cell
System, determine non-gauge outfit cell possess and control gauge outfit cell;According to non-gauge outfit cell and the corresponding pass for possessing and control gauge outfit cell
The output rule of system and corresponding gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.With
The prior art is compared, and this method does not need manually participation information extraction, and those skilled in the art only need to configure primary defeated
It is regular out, so that it may to extract valuable information from a large amount of web page form automatically using this method, therefore improve information
Extraction efficiency.
Fig. 2 is a kind of flow chart of the method and step S102 of the Extracting Information from table provided by the embodiments of the present application.
In one embodiment, step S102 may comprise steps of as shown in Figure 2:
Step S201, the cell by inter-bank attribute value or across Column Properties values more than or equal to 2 are determined as the conjunction
And cell.
Specifically, for HTML code, if two attributes of the rowspan and colspan of some cell are appointed
The attribute value of meaning one is more than or equal to 2, then the cell is Merge Cells.
Step S202 determines what the Merge Cells should decompose according to the inter-bank attribute value and across the Column Properties values
Destination number.
In one embodiment, the destination number that Merge Cells should decompose is equal to inter-bank attribute value and described across Column Properties
The product of value.Illustratively: rowspan=2, colspan=3 of some cell, then this Merge Cells can resolve into 2
Row 3 arranges totally 6 (2 × 3) a cell.
The Merge Cells are resolved into the cell of destination number by step S203, and will be in the Merge Cells
Content copy to decompose after each cell in.
The position of each cell in table is accurately positioned in row and column fashion as a result, convenient in subsequent step
Positional relationship in rapid between analytical unit lattice improves information extraction efficiency.
In one embodiment, the extraction expression formula of leaf node is by prezone expression formula, rear boundary's expression formula, and before being located at
Extraction expression formula between boundary's expression formula and rear boundary's expression formula;Prezone expression formula is correspondingly arranged multiple concept values, and concept value is used for
Whether it is gauge outfit cell with determination unit lattice with the content matching of cell, and determines the gauge outfit classification of gauge outfit cell;
Extraction expression formula possess and control relationship for basis and extracts content from the corresponding non-gauge outfit cell of gauge outfit classification.
One exemplary expression formula are as follows:
C_ assessment of bids date prefix { 0,0 }@[d:: the date]+((Beijing time))? @{ 0,0 } c_ assessment of bids date suffix
Wherein, prezone expression formula are as follows: c_ assessment of bids date prefix;Boundary's expression formula afterwards are as follows: c_ assessment of bids date suffix;Extract expression
Formula are as follows:@[d:: the date]+((Beijing time))? @.
It further, include " assessment of bids date prefix " this concept, the settable example of the concept in " c_ assessment of bids date prefix "
Such as " public offer time: " " bid opening date: " " consultation date: " " assessment of bids date: " concept value.
Further, " { 0,0 } " is distance condition expression formula, and distance condition expression formula is located at prezone expression formula and extracts table
Up between formula, extracted between expression formula and rear boundary's expression formula alternatively, being located at.Distance condition expression formula is by a minimum range and most
Big distance composition.Illustratively, when distance condition expression formula " { 0,0 } " is located at prezone expression formula and extracts between expression formula, table
Show prezone expression formula and extracts the distance between the matched content of expression formula as 0 (if it is " { 0,1 } ", then it represents that prezone expression
The distance between formula and the extraction matched content of expression formula are 0 to 1).
Further, "@[d:: the date]+((Beijing time)) ?@" indicate extract and can be matched to from cell
" [d:: the date] ", and matching " [d:: the date] " after at least occur once it is interior comprising " Beijing time "
Hold;Above-mentioned extraction expression formula is designed based on the matching rule of regular expression, and details are not described herein again.Those skilled in the art can also
To use other expression formula rule designs to extract expression formula, the design that can be applied here and design are without departing from this Shen
Please embodiment protection scope.
Here is the Installation practice of the application.
The Installation practice of the application provides the device of the Extracting Information from table, which can be used for executing the application
Embodiment of the method, technical detail undocumented for the application Installation practice please refers to the present processes embodiment.
Fig. 3 is the schematic diagram of the device of the Extracting Information provided by the embodiments of the present application from table.
As shown in figure 3, the device includes:
Parsing module 301 is used for analyzing web page source code, is extracted in webpage according to the form tag in web page source code
Table code;
Form processing modules 302, for according in the table code cell inter-bank attribute value and across Column Properties values
Whether analysis table includes Merge Cells across multirow or multiple row, if comprising being decomposed into the Merge Cells multiple
Cell;
Gauge outfit analysis module 303, for table each cell carry out rule analysis, with obtain gauge outfit cell and
Corresponding gauge outfit classification;
Possess and control relationship analysis module 304, for according to position of the non-gauge outfit cell relative to gauge outfit cell in table
Relationship is set, determine non-gauge outfit cell possess and control gauge outfit cell;
Abstraction module 305, for according to non-gauge outfit cell with possess and control the corresponding relationship of gauge outfit cell and corresponding
The output rule of gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.
From the above technical scheme, the embodiment of the present application provides a kind of device of Extracting Information from table.The dress
It sets for analyzing web page source code, the table code in webpage is extracted according to the form tag in web page source code;According to table
Whether cell inter-bank attribute value in code and across Column Properties values analysis table include combining unit across multirow or multiple row
Lattice, if comprising Merge Cells are decomposed into multiple cells;Rule analysis is carried out to each cell in table, to obtain
Take gauge outfit cell and corresponding gauge outfit classification;It closes position according to the non-gauge outfit cell in table relative to gauge outfit cell
System, determine non-gauge outfit cell possess and control gauge outfit cell;According to non-gauge outfit cell and the corresponding pass for possessing and control gauge outfit cell
The output rule of system and corresponding gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.With
The prior art is compared, and manually participation information extraction is not needed, and those skilled in the art only need to configure primary output rule,
Valuable information can be extracted from a large amount of web page form automatically using the device, therefore improve information extraction effect
Rate.
Those skilled in the art will readily occur to its of the application after considering specification and practicing application disclosed herein
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or
Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following
Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.