CN110188107A - A kind of method and device of the Extracting Information from table - Google Patents

A kind of method and device of the Extracting Information from table Download PDF

Info

Publication number
CN110188107A
CN110188107A CN201910486551.6A CN201910486551A CN110188107A CN 110188107 A CN110188107 A CN 110188107A CN 201910486551 A CN201910486551 A CN 201910486551A CN 110188107 A CN110188107 A CN 110188107A
Authority
CN
China
Prior art keywords
gauge outfit
cell
expression formula
classification
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910486551.6A
Other languages
Chinese (zh)
Other versions
CN110188107B (en
Inventor
任宁
卢彦博
晋耀红
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co., Ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201910486551.6A priority Critical patent/CN110188107B/en
Publication of CN110188107A publication Critical patent/CN110188107A/en
Application granted granted Critical
Publication of CN110188107B publication Critical patent/CN110188107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present application provides a kind of method and device of Extracting Information from table, can analyzing web page source code, according in web page source code form tag extract webpage in table code;According in table code cell inter-bank attribute value and across Column Properties values Merge Cells are decomposed into multiple cells;Obtain gauge outfit cell and corresponding gauge outfit classification;According to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table, determine non-gauge outfit cell possess and control gauge outfit cell;According to the output rule of non-gauge outfit cell and the corresponding relationship and corresponding gauge outfit classification of possessing and control gauge outfit cell, the content in gauge outfit cell and non-gauge outfit cell is extracted and exported.Technical solution provided by the embodiments of the present application does not need manually participation information extraction, and those skilled in the art only need to configure primary output rule, so that it may extract valuable information from a large amount of web page form automatically, therefore improve information extraction efficiency.

Description

A kind of method and device of the Extracting Information from table
Technical field
This application involves natural language processing technique field more particularly to a kind of methods and dress of the Extracting Information from table It sets.
Background technique
In the information exhibition method such as webpage, electronic document, some information can be shown in a manner of table.Table generally by Attribute-name (gauge outfit) and attribute value composition, every a line of table or each column data data similarity usually with higher, Therefore consistent data area and other shared data characteristicses compared with the data exhibition method of plain text, are shown in table Data be usually structural stronger data.
In social production, table is normally used for the use of the publicities industry data such as department, mechanism or enterprise, such as: it recruits Enterprise is marked when carrying out project bidding, can show the information such as the call for tender, acceptance of the bid result using table in publicity website, this A little information usually have use value for other enterprises in didding enterprise and industry, therefore, the usual demand of didding enterprise from Valuable information is obtained in these tables, however, generally comprising for enterprise, in internet largely with table displaying Valuable information, if obtaining information from table by way of manually browsing, workload can be very huge, and efficiency is not It is high.
Summary of the invention
The embodiment of the present application provides a kind of method and device of Extracting Information from table, to solve the prior art from table When obtaining information in lattice, the huge and inefficient problem of workload.
In a first aspect, the embodiment of the present application provides a kind of method of Extracting Information from table.This method comprises: parsing Web page source code extracts the table code in webpage according to the form tag in web page source code;According in the table code Cell inter-bank attribute value and across Column Properties values analysis table whether include Merge Cells across multirow or multiple row, if packet Contain, the Merge Cells are decomposed into multiple cells;Rule analysis is carried out to each cell in table, to obtain table Head unit lattice and corresponding gauge outfit classification;According to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table, Determine non-gauge outfit cell possess and control gauge outfit cell;According to non-gauge outfit cell and the corresponding relationship for possessing and control gauge outfit cell, And the output rule of corresponding gauge outfit classification, extract and export the content in gauge outfit cell and non-gauge outfit cell.
Second aspect, the embodiment of the present application provide a kind of device of Extracting Information from table.The device includes: parsing Module is used for analyzing web page source code, extracts the table code in webpage according to the form tag in web page source code;At table Manage module, for according in the table code cell inter-bank attribute value and across Column Properties values analysis table whether include across The more Merge Cells of multirow or multiple row, if comprising the Merge Cells are decomposed into multiple cells;Gauge outfit analyzes mould Block carries out rule analysis for each cell to table, to obtain gauge outfit cell and corresponding gauge outfit classification;Possess and control pass It is analysis module, for determining non-gauge outfit according to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table Cell possess and control gauge outfit cell;Abstraction module, for according to non-gauge outfit cell and the corresponding pass for possessing and control gauge outfit cell The output rule of system and corresponding gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.
Technical solution provided by the embodiments of the present application, can analyzing web page source code, according to the table in web page source code Table code in tag extraction webpage;According to the cell inter-bank attribute value and across Column Properties values analysis table in table code Whether comprising the Merge Cells across multirow or multiple row, if comprising Merge Cells are decomposed into multiple cells;To table In each cell carry out rule analysis, to obtain gauge outfit cell and corresponding gauge outfit classification;According to the non-table in table Positional relationship of the head unit lattice relative to gauge outfit cell, determine non-gauge outfit cell possess and control gauge outfit cell;According to non-table The output rule of head unit lattice and the corresponding relationship and corresponding gauge outfit classification of possessing and control gauge outfit cell, extracts and exports table Content in head unit lattice and non-gauge outfit cell.Compared with prior art, technical solution provided by the embodiments of the present application is not required to It manually participation information to extract, those skilled in the art only need to configure primary output rule, so that it may automatically from a large amount of Web page form in extract valuable information, therefore improve information extraction efficiency.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of the method for the Extracting Information from table provided by the embodiments of the present application;
Fig. 2 is a kind of flow chart of the method and step S102 of the Extracting Information from table provided by the embodiments of the present application;
Fig. 3 is a kind of schematic diagram of the device of the Extracting Information from table provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without making creative work Range.
The embodiment of the present application provides a kind of method and device of Extracting Information from table, to solve the prior art from table When obtaining information in lattice, the huge and inefficient problem of workload.
Here is the present processes embodiment.
The present processes embodiment provides a kind of method of Extracting Information from table.Fig. 1 is that this takes out from table Win the confidence breath method flow chart.It is a variety of that this method can be applied to server, PC (PC), tablet computer, mobile phone etc. In equipment.
As shown in Figure 1, method includes the following steps:
Step S101, analyzing web page source code extract the table generation in webpage according to the form tag in web page source code Code.
Web page source code is usually constructed by hypertext markup language (HyperText Markup Language, HTML). HTML is a kind of for creating the standard markup language of webpage, often with cascading style sheets (Cascading Style Sheets, CSS), JavaScript is used to design the user of webpage, Web page application program and mobile applications by numerous websites together Interface.And all kinds of tables in webpage are usually to be realized by html language.
In HTML, table is defined by<table>label.Each table has several rows (by<tr>tag definition), Every row is divided into several cells (by<td>tag definition).Letter t d refers to list data (table data), i.e. data sheet The content of first lattice.Therefore, by parsing HTML code, form tag<table>is got, it will be able to extract from HTML code Table code out.
Illustratively, one section of HTML code comprising table code is as follows:
<html>
<body>
<h1>address list</h1>
<table>
<tr>
<th>name</th>
<thcolspan="2">phone</th>
</tr>
<tr>
<td>Bill Gates</td>
<td>555 77 854</td>
<td>555 77 855</td>
</tr>
</table>
</body>
</html>
Wherein, form tag<table>with</table>code between (end-tag of table) is table code (part of following marking in code).The table that above table code is shown in webpage is as follows:
In one embodiment, in order to meet the specific functional demand or business demand of department, mechanism or enterprise, technology Personnel can be used orientation crawler and crawl web page source code from specified webpage.Such as: the contractor of building trade applies Work enterprise can discharge crawler in specified website, and web page source code is crawled in monitoring site content update, or timing is climbed Take web page source code.
In one embodiment, user can manually select the source code for parsing which webpage.Such as: webpage is browsed in user Equipment on installation for realizing this method application program, when user wishes the Extracting Information from the table of some webpage, User the network address of this webpage can be inputted (input mode includes but is not limited to: keyboard input, text duplication paste, text Dragging, optical character identification etc.) into application program, make application program analyzing web page source code and from extract table code.
In some scenes, the table in webpage may be made of multiple table nestings, such as: some webpages are being compiled Just various information is shown in the form of multiple nested tables when writing.However lead in webpage there are when the table of multiple nestings Often the table of only innermost layer is only the table as worth of data publicity.Therefore, in one embodiment, step S101 is logical Analyzing web page source code is crossed, after finding form tag, also judges whether form tag has multilayer nest relationship, if table There are multilayer nest relationships for label, then the corresponding table code of form tag of innermost layer are extracted, if form tag is not present Multilayer nest relationship then directly extracts all forms code.Reduce the range of the Extracting Information from table as a result, improves and extract Efficiency and accuracy.
Step S102, according in the table code cell inter-bank attribute value and across Column Properties values analysis table whether Comprising the Merge Cells across multirow or multiple row, if comprising the Merge Cells are decomposed into multiple cells.
In HTML, the inter-bank attribute of table and determined respectively by two attributes of rowspan and colspan across Column Properties Justice.Such as: rowspan=" 2 " indicates that inter-bank attribute value is " 2 ", i.e. cell crosses over two rows;Colspan=" 2 " is indicated across column Attribute value is " 2 ", i.e. cell crosses over two rows.
Due to above-mentioned inter-bank or the cell across column exists, and the cell structure of table becomes irregular, is unfavorable for information Extraction, and be also possible to reduce the accuracy collected of information.The embodiment of the present application in step s 102, in order to guarantee the application Embodiment extract form data accuracy, if according in table code cell inter-bank attribute value and across Column Properties values it is true Determining table includes Merge Cells, then decomposes Merge Cells, the line number and columns for occupying cell all in table It is all 1.
Illustratively, a table comprising Merge Cells are as follows:
The table obtained after Merge Cells are decomposed are as follows:
Ranking Organization Offer by tender Bottled water Bottled water Branch fills water Branch fills water
Ranking Organization Offer by tender Valence containing duty receipt Without duty receipt valence Valence containing duty receipt Without duty receipt valence
1 A trade Co., Ltd 200500 17.55 15 1.25 1.07
2 Beverage B Co., Ltd 198900 17.55 15 1.17 1
3 C beverage Co., Ltd 180000 15 12.82 1.5 1.28
It in one embodiment, can also be to each list in table after being decomposed to the Merge Cells of table First lattice carry out position mark, that is, mark the row coordinate at the place of each cell and the column coordinate at place, and can be formed as follows Table:
Content Row coordinate Column coordinate
Ranking 1 1
Organization 1 2
Offer by tender (member) 1 3
Bottled water 1 4
Bottled water 1 5
……
The position of each cell in table is accurately positioned in a manner of row coordinate and column coordinate as a result, is convenient for Positional relationship in subsequent step between analytical unit lattice improves information extraction efficiency.
Step S103 carries out rule analysis to each cell in table, to obtain gauge outfit cell and corresponding table Head classification.
In one embodiment, the content of each cell can be sent to the analysis model pre-established, by analyzing Model analyzes cell content according to its built-in analysis rule, to identify which cell is gauge outfit unit Lattice, which cell are non-gauge outfit cells, and determine the corresponding gauge outfit classification of gauge outfit cell.Wherein, gauge outfit cell is Refer to that the cell where the Table Header information in table, non-gauge outfit cell refer to the unit where the non-Table Header information in table Lattice.
In one embodiment, the extraction tree pre-established can be used to analyze each cell in table.
Illustratively, one may include with flowering structure for " bid publicity " extraction tree that the table of class is established:
Call for bid publicity
+ character analysis
Table gauge outfit analyzes -- gauge outfit node --
Highest bidder's information gauge outfit -- gauge outfit classification child node --
It gets the bid serial number -- leaf node --
Highest bidder/candidate ...
Middle marked price
Acceptance of the bid content
Project information gauge outfit -- gauge outfit classification child node --
Bid inviter's -- leaf node --
Project name ...
It calls for bid the time
Bidding number
Tender agent
The assessment of bids date
Assessment of bids place
Bid evaluation committee's composition
Competitive bid unit contact person
Competitive bid unit telephone number
Agency contact person
Agency contact address
Agency's telephone number
The publicity beginning and ending time
The publicity time started
Publicity deadline
+ indirect input gauge outfit -- gauge outfit classification child node --
Other gauge outfits
It sees example above, extracting tree includes gauge outfit node, and gauge outfit node includes multiple gauge outfit classification child nodes, Mei Gebiao Head classification child node includes multiple leaf nodes.
In one embodiment, at least one expression formula can be set in leaf node, which is used for and cell Content matched, whether be gauge outfit cell with determination unit lattice, and, determine the corresponding gauge outfit class of gauge outfit cell Not.For example, an expression formula in leaf node " highest bidder/candidate " has been matched to cell content " organization ", then It can determine that the cell where " organization " is gauge outfit cell, corresponding gauge outfit classification is " highest bidder/candidate ".
In one embodiment, gauge outfit classification can further include gauge outfit group and gauge outfit major class.Wherein, gauge outfit group The title of target leaves node belonging to the expression formula being matched to for cell, gauge outfit major class are mesh belonging to target leaves node Mark the title of gauge outfit classification child node.For example, the corresponding gauge outfit group of gauge outfit cell " organization " is " highest bidder/candidate People ", gauge outfit major class are " highest bidder's information ".
In one embodiment, the rule analysis to each cell is as a result, the table of following form can be organized into:
Cell content Row coordinate Column coordinate Whether gauge outfit Gauge outfit classification
Ranking 1 1 It is Acceptance of the bid serial number
Organization 1 2 It is Highest bidder/candidate
Offer by tender 1 3 It is Middle marked price
Bottled water 1 4 It is Other
…… …… …… …… ……
A trade Co., Ltd 3 2 It is no
200500 3 3 It is no
……
The row coordinate of gauge outfit cell and column coordinate are accurately positioned as a result, are conducive to determine gauge outfit in the next steps Cell and non-gauge outfit cell possess and control relationship.
Step S104 determines non-table according to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table Head unit lattice possess and control gauge outfit cell.
In general, the gauge outfit cell of possessing and control of non-gauge outfit cell is usually the non-gauge outfit cell in a table With first gauge outfit cell on the left of a line, alternatively, the gauge outfit cell above same row.
In one embodiment, in order to which determine non-gauge outfit cell possess and control gauge outfit cell, non-table can be obtained first The row coordinate and column coordinate of head unit lattice and gauge outfit cell;Then non-gauge outfit unit is searched according to the row coordinate and column coordinate The left side of lattice or the gauge outfit cell of top, and possess and control gauge outfit for what the gauge outfit cell found was determined as non-gauge outfit cell Cell.
Illustratively: the position coordinates of non-gauge outfit cell " A trade Co., Ltd " are [2,3], since its left side is without same Capable gauge outfit cell, thus can upwards along same row look-up table head unit lattice, thereon side can find position coordinates be [1, 2] and the identical gauge outfit cell " organization " of two contents of [2,2] (from Merge Cells decomposition), therefore gauge outfit Cell " organization " is just that non-gauge outfit cell " A trade Co., Ltd " possess and control gauge outfit cell.This possess and control gauge outfit list First lattice corresponding node path in extracting tree are as follows: bid publicity-table gauge outfit analysis-highest bidder's information gauge outfit-highest bidder/time It chooses.Therefore this possess and control the corresponding gauge outfit group of gauge outfit cell be " highest bidder/candidate ", corresponding gauge outfit major class be " in Mark people's information gauge outfit ".
Possess and control the method for gauge outfit cell according to above-mentioned determination, successively the non-gauge outfit cell of a line every in table is carried out Analysis, determine each non-gauge outfit cell possess and control gauge outfit cell.
In one embodiment, the implementing result of step S104 can form following table:
Non- gauge outfit cell Possess and control gauge outfit cell Affiliated group Affiliated major class Row coordinate
1 Ranking Acceptance of the bid serial number Highest bidder's information gauge outfit 3
A trade Co., Ltd Organization Highest bidder Highest bidder's information gauge outfit 3
200500 Offer by tender Middle marked price Highest bidder's information gauge outfit 3
17.55 Nothing Nothing Nothing 3
15 Nothing Nothing Nothing 3
1.25 Nothing Nothing Nothing 3
Since gauge outfit cell, institute is all not present on the left of same a line of non-gauge outfit cell " 17.55 " and above same row With non-gauge outfit cell " 17.55 " possess and control gauge outfit cell there is no corresponding;Therefore, in above table, " 17.55 " and " neck The cell of category gauge outfit cell " " affiliated group " and " affiliated major class " intersection inserts "None".Similarly, " 15 " " 1.25 " with The cell of " possessing and control gauge outfit cell " " affiliated group " and " affiliated major class " intersection also inserts "None".
Step S105, according to non-gauge outfit cell with possess and control gauge outfit cell corresponding relationship and corresponding gauge outfit class Other output rule, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.
Specifically, the content in non-gauge outfit cell and gauge outfit cell will export in pairs according to their relationship of possessing and control, In output, according to the output of the corresponding gauge outfit classification of gauge outfit cell rule, determination specifically exports those contents, and, with Which type of format exports these contents.
In some embodiments, those skilled in the art can be according to the specific field allocation list of application the embodiment of the present application The output rule of head classification.The example of a bid publicity table configuration output rule for building field is given below, with For those skilled in the art understand that the embodiment of the present application configures the technical concept of output rule, and reference and reference above-mentioned example reality Apply the technical solution of the embodiment of the present application.
Illustratively, the table of information to be extracted is as follows:
Step S101- step S104 is executed to above table, following table can be obtained:
Cell content Affiliated gauge outfit group Affiliated gauge outfit major class Row
The design of the new district 2019XX Bureau of Education afforestation design improvement project Project name Project information gauge outfit 1
SF20199999 Bidding number Project information gauge outfit 1
The city XX, Bureau of Education's engineering management center, the new district XX Bid inviter Project information gauge outfit 2
XX Project Management Service Co., Ltd Tender agent Project information gauge outfit 3
Wang Gong Contact person Indirect output gauge outfit 4
88888888 Telephone number Indirect output gauge outfit 4
3 building, the new district the XX road XX 357 Contact address Indirect output gauge outfit 5
100000 Other gauge outfits Other gauge outfits 5
Competitive bidding Other gauge outfits Other gauge outfits 6
1 Acceptance of the bid serial number Highest bidder's information gauge outfit 8
XX landscape industry development Co., Ltd Highest bidder/candidate Highest bidder's information gauge outfit 8
2 Acceptance of the bid serial number Highest bidder's information gauge outfit 9
XX afforests artistic development Co., Ltd Highest bidder/candidate Highest bidder's information gauge outfit 9
Note: 2 bidders are as acceptance of the bid candidate before ranking. Acceptance of the bid serial number Highest bidder's information gauge outfit 10
On October 8th, 2019, China X X bid net Other gauge outfits Other gauge outfits 11
On November 5,1 day to 2019 November in 2019 The publicity beginning and ending time Project information gauge outfit 12
Xu XX, X, week XX, Liu XX, river XX Other gauge outfits Other gauge outfits 13
For above-mentioned gauge outfit major class and gauge outfit group, it is possible to specify following output rule:
The affiliated gauge outfit major class of cell is " other gauge outfits ", is not exported;
The affiliated gauge outfit major class of cell is " sundry item gauge outfit ", and cell content directly exports;
The affiliated gauge outfit major class of cell is " indirect output gauge outfit ", is expert at and is found positioned at cell in cell Before nearest " project information gauge outfit ", and according to the content of for example following adduction relationship output unit lattice:
Contact person+tender agent=agency contact person
Contact address+tender agent=agency contact address
Telephone number+tender agent=agency's telephone number
Contact person+bid inviter=competitive bid unit contact person
Contact address+bid inviter=competitive bid unit contact address
Telephone number+bid inviter=competitive bid unit telephone number
According to above-mentioned output rule, step S105 can for example be extracted from the table of information to be extracted and be exported in following Hold:
Project name: the new district 2019XX Bureau of Education afforestation design improvement project design
Bidding number: SF20199999
Bid inviter: the city XX, Bureau of Education's engineering management center, the new district XX
Tender agent: XX Project Management Service Co., Ltd
Competitive bid unit contact person: Wang Gong
Competitive bid unit telephone number: 88888888
Competitive bid unit contact address: 3 building, the new district the XX road XX 357
The publicity beginning and ending time: on November 5,1 to 1019 year November in 2019
Highest bidder or acceptance of the bid candidate message:
Acceptance of the bid serial number: 1
Highest bidder/candidate: XX landscape industry development Co., Ltd
Acceptance of the bid serial number: 2
Highest bidder/candidate: XX afforests artistic development Co., Ltd
As a result, in web page form valuable information according to output rule be extracted output, can replace it is existing manually The mode of browsing avoids the workload for generating huge manual work, improves extraction efficiency.
From the above technical scheme, the embodiment of the present application provides a kind of method of Extracting Information from table.The party Method includes: analyzing web page source code, extracts the table code in webpage according to the form tag in web page source code;According to table Whether cell inter-bank attribute value in code and across Column Properties values analysis table include combining unit across multirow or multiple row Lattice, if comprising Merge Cells are decomposed into multiple cells;Rule analysis is carried out to each cell in table, to obtain Take gauge outfit cell and corresponding gauge outfit classification;It closes position according to the non-gauge outfit cell in table relative to gauge outfit cell System, determine non-gauge outfit cell possess and control gauge outfit cell;According to non-gauge outfit cell and the corresponding pass for possessing and control gauge outfit cell The output rule of system and corresponding gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.With The prior art is compared, and this method does not need manually participation information extraction, and those skilled in the art only need to configure primary defeated It is regular out, so that it may to extract valuable information from a large amount of web page form automatically using this method, therefore improve information Extraction efficiency.
Fig. 2 is a kind of flow chart of the method and step S102 of the Extracting Information from table provided by the embodiments of the present application.
In one embodiment, step S102 may comprise steps of as shown in Figure 2:
Step S201, the cell by inter-bank attribute value or across Column Properties values more than or equal to 2 are determined as the conjunction And cell.
Specifically, for HTML code, if two attributes of the rowspan and colspan of some cell are appointed The attribute value of meaning one is more than or equal to 2, then the cell is Merge Cells.
Step S202 determines what the Merge Cells should decompose according to the inter-bank attribute value and across the Column Properties values Destination number.
In one embodiment, the destination number that Merge Cells should decompose is equal to inter-bank attribute value and described across Column Properties The product of value.Illustratively: rowspan=2, colspan=3 of some cell, then this Merge Cells can resolve into 2 Row 3 arranges totally 6 (2 × 3) a cell.
The Merge Cells are resolved into the cell of destination number by step S203, and will be in the Merge Cells Content copy to decompose after each cell in.
The position of each cell in table is accurately positioned in row and column fashion as a result, convenient in subsequent step Positional relationship in rapid between analytical unit lattice improves information extraction efficiency.
In one embodiment, the extraction expression formula of leaf node is by prezone expression formula, rear boundary's expression formula, and before being located at Extraction expression formula between boundary's expression formula and rear boundary's expression formula;Prezone expression formula is correspondingly arranged multiple concept values, and concept value is used for Whether it is gauge outfit cell with determination unit lattice with the content matching of cell, and determines the gauge outfit classification of gauge outfit cell; Extraction expression formula possess and control relationship for basis and extracts content from the corresponding non-gauge outfit cell of gauge outfit classification.
One exemplary expression formula are as follows:
C_ assessment of bids date prefix { 0,0 }@[d:: the date]+((Beijing time))? @{ 0,0 } c_ assessment of bids date suffix
Wherein, prezone expression formula are as follows: c_ assessment of bids date prefix;Boundary's expression formula afterwards are as follows: c_ assessment of bids date suffix;Extract expression Formula are as follows:@[d:: the date]+((Beijing time))? @.
It further, include " assessment of bids date prefix " this concept, the settable example of the concept in " c_ assessment of bids date prefix " Such as " public offer time: " " bid opening date: " " consultation date: " " assessment of bids date: " concept value.
Further, " { 0,0 } " is distance condition expression formula, and distance condition expression formula is located at prezone expression formula and extracts table Up between formula, extracted between expression formula and rear boundary's expression formula alternatively, being located at.Distance condition expression formula is by a minimum range and most Big distance composition.Illustratively, when distance condition expression formula " { 0,0 } " is located at prezone expression formula and extracts between expression formula, table Show prezone expression formula and extracts the distance between the matched content of expression formula as 0 (if it is " { 0,1 } ", then it represents that prezone expression The distance between formula and the extraction matched content of expression formula are 0 to 1).
Further, "@[d:: the date]+((Beijing time)) ?@" indicate extract and can be matched to from cell " [d:: the date] ", and matching " [d:: the date] " after at least occur once it is interior comprising " Beijing time " Hold;Above-mentioned extraction expression formula is designed based on the matching rule of regular expression, and details are not described herein again.Those skilled in the art can also To use other expression formula rule designs to extract expression formula, the design that can be applied here and design are without departing from this Shen Please embodiment protection scope.
Here is the Installation practice of the application.
The Installation practice of the application provides the device of the Extracting Information from table, which can be used for executing the application Embodiment of the method, technical detail undocumented for the application Installation practice please refers to the present processes embodiment.
Fig. 3 is the schematic diagram of the device of the Extracting Information provided by the embodiments of the present application from table.
As shown in figure 3, the device includes:
Parsing module 301 is used for analyzing web page source code, is extracted in webpage according to the form tag in web page source code Table code;
Form processing modules 302, for according in the table code cell inter-bank attribute value and across Column Properties values Whether analysis table includes Merge Cells across multirow or multiple row, if comprising being decomposed into the Merge Cells multiple Cell;
Gauge outfit analysis module 303, for table each cell carry out rule analysis, with obtain gauge outfit cell and Corresponding gauge outfit classification;
Possess and control relationship analysis module 304, for according to position of the non-gauge outfit cell relative to gauge outfit cell in table Relationship is set, determine non-gauge outfit cell possess and control gauge outfit cell;
Abstraction module 305, for according to non-gauge outfit cell with possess and control the corresponding relationship of gauge outfit cell and corresponding The output rule of gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.
From the above technical scheme, the embodiment of the present application provides a kind of device of Extracting Information from table.The dress It sets for analyzing web page source code, the table code in webpage is extracted according to the form tag in web page source code;According to table Whether cell inter-bank attribute value in code and across Column Properties values analysis table include combining unit across multirow or multiple row Lattice, if comprising Merge Cells are decomposed into multiple cells;Rule analysis is carried out to each cell in table, to obtain Take gauge outfit cell and corresponding gauge outfit classification;It closes position according to the non-gauge outfit cell in table relative to gauge outfit cell System, determine non-gauge outfit cell possess and control gauge outfit cell;According to non-gauge outfit cell and the corresponding pass for possessing and control gauge outfit cell The output rule of system and corresponding gauge outfit classification, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.With The prior art is compared, and manually participation information extraction is not needed, and those skilled in the art only need to configure primary output rule, Valuable information can be extracted from a large amount of web page form automatically using the device, therefore improve information extraction effect Rate.
Those skilled in the art will readily occur to its of the application after considering specification and practicing application disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

Claims (10)

1. a kind of method of the Extracting Information from table characterized by comprising
Analyzing web page source code extracts the table code in webpage according to the form tag in web page source code;
According in the table code cell inter-bank attribute value and across Column Properties values analysis table whether include across multirow Or the Merge Cells of multiple row, if comprising the Merge Cells are decomposed into multiple cells;
Rule analysis is carried out to each cell in table, to obtain gauge outfit cell and corresponding gauge outfit classification;
According to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table, possessing and control for non-gauge outfit cell is determined Gauge outfit cell;
Output according to non-gauge outfit cell and the corresponding relationship and corresponding gauge outfit classification of possessing and control gauge outfit cell is regular, Extract and export the content in gauge outfit cell and non-gauge outfit cell.
2. the method according to claim 1, wherein the analyzing web page source code, according in web page source code Form tag extract webpage in table code, comprising:
The form tag in web page source code is searched, and determines whether form tag has multilayer nest relationship, if there is more Layer nest relation, the then corresponding table code of form tag for extracting innermost layer are then extracted if there is no multilayer nest relationship The corresponding all forms code of form tag.
3. the method according to claim 1, wherein the inter-bank attribute value according to cell in table code Whether include Merge Cells across multirow or multiple row with across Column Properties values analysis table, and the Merge Cells are decomposed For multiple cells, comprising:
Cell by inter-bank attribute value or across Column Properties values more than or equal to 2 is determined as the Merge Cells;
The destination number that the Merge Cells should decompose is determined according to the inter-bank attribute value and across the Column Properties values;
The Merge Cells resolve into the cell of destination number, and the content in the Merge Cells is copied to point In each cell after solution.
4. the method according to claim 1, wherein each cell to table carries out rule analysis, To obtain gauge outfit cell and corresponding gauge outfit classification, comprising:
Rule analysis is carried out using tree is extracted to each cell, wherein the extraction tree includes gauge outfit node, the gauge outfit section Point includes at least one gauge outfit classification child node, and each gauge outfit classification child node includes at least one leaf node, described Leaf node is provided with expression formula, the expression formula for being matched with the content of cell, with determination unit lattice whether be The gauge outfit cell, and, determine the corresponding gauge outfit classification of the gauge outfit cell.
5. according to the method described in claim 4, it is characterized in that, the gauge outfit classification includes gauge outfit group and gauge outfit major class, The title of target leaves node belonging to the expression formula that the gauge outfit group is matched to for cell, the gauge outfit major class are The title of target gauge outfit classification child node belonging to the target leaves node.
6. the method according to claim 1, wherein the non-gauge outfit cell according in table is relative to table The positional relationship of head unit lattice, determine non-gauge outfit cell possess and control gauge outfit cell, comprising:
The row coordinate and column coordinate for obtaining non-gauge outfit cell and gauge outfit cell are searched non-according to the row coordinate and column coordinate The left side of gauge outfit cell or the gauge outfit cell of top, and the gauge outfit cell found is determined as non-gauge outfit cell Possess and control gauge outfit cell.
7. according to the method described in claim 5, it is characterized in that, the gauge outfit major class is according to content output rule including direct Gauge outfit major class, the gauge outfit major class of indirect output, the gauge outfit major class of full line output and the gauge outfit major class not exported of output.
8. according to the method described in claim 4, it is characterized in that, the expression formula includes prezone expression formula, rear boundary's expression formula, And the extraction expression formula between the prezone expression formula and the rear boundary expression formula;The prezone expression formula is correspondingly arranged Whether multiple concept values, the concept value are gauge outfit cells with determination unit lattice for the content matching with cell, and Determine the gauge outfit classification of gauge outfit cell;The extraction expression formula is taken out with from the corresponding non-gauge outfit cell of the gauge outfit classification Take content.
9. described according to the method described in claim 8, it is characterized in that, the expression formula further includes distance condition expression formula Distance condition expression formula is between the prezone expression formula and the extraction expression formula, alternatively, the distance condition expression formula Between the extraction expression formula and the rear boundary expression formula.
10. a kind of device of the Extracting Information from table characterized by comprising
Parsing module is used for analyzing web page source code, extracts the table generation in webpage according to the form tag in web page source code Code;
Form processing modules, for according to the cell inter-bank attribute value and across Column Properties values analysis table in the table code Whether comprising the Merge Cells across multirow or multiple row, if comprising the Merge Cells are decomposed into multiple cells;
Gauge outfit analysis module carries out rule analysis for each cell to table, to obtain gauge outfit cell and corresponding Gauge outfit classification;
Possess and control relationship analysis module, for according to positional relationship of the non-gauge outfit cell relative to gauge outfit cell in table, Determine non-gauge outfit cell possess and control gauge outfit cell;
Abstraction module, for according to non-gauge outfit cell with possess and control gauge outfit cell corresponding relationship and corresponding gauge outfit class Other output rule, extracts and exports the content in gauge outfit cell and non-gauge outfit cell.
CN201910486551.6A 2019-06-05 2019-06-05 Method and device for extracting information from table Active CN110188107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910486551.6A CN110188107B (en) 2019-06-05 2019-06-05 Method and device for extracting information from table

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910486551.6A CN110188107B (en) 2019-06-05 2019-06-05 Method and device for extracting information from table

Publications (2)

Publication Number Publication Date
CN110188107A true CN110188107A (en) 2019-08-30
CN110188107B CN110188107B (en) 2020-05-01

Family

ID=67720498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910486551.6A Active CN110188107B (en) 2019-06-05 2019-06-05 Method and device for extracting information from table

Country Status (1)

Country Link
CN (1) CN110188107B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server
CN111401010A (en) * 2020-03-25 2020-07-10 苏州机数芯微科技有限公司 Form extraction method based on machine learning
CN111427973A (en) * 2020-04-21 2020-07-17 贵州新致普惠信息技术有限公司 Planning table data analysis method
CN111913993A (en) * 2020-08-12 2020-11-10 望海康信(北京)科技股份公司 Table data generation method and device, electronic equipment and computer readable storage medium
CN112395418A (en) * 2020-11-26 2021-02-23 上海携宁计算机科技股份有限公司 Method and device for extracting target object in webpage and electronic equipment
CN112597927A (en) * 2020-12-28 2021-04-02 电子科技大学 Two-dimensional table identification method, device, equipment and system
CN113343658A (en) * 2021-07-01 2021-09-03 湖南四方天箭信息科技有限公司 PDF file information extraction method and device and computer equipment
CN113496119A (en) * 2020-03-20 2021-10-12 北京庖丁科技有限公司 Method, electronic device and computer readable medium for extracting tuple data in table
CN113656592A (en) * 2021-07-22 2021-11-16 北京百度网讯科技有限公司 Data processing method and device based on knowledge graph, electronic equipment and medium
CN113987112A (en) * 2021-12-24 2022-01-28 杭州恒生聚源信息技术有限公司 Table information extraction method and device, storage medium and electronic equipment
CN115048916A (en) * 2022-05-27 2022-09-13 北京百度网讯科技有限公司 Table processing method and device
CN115730121A (en) * 2022-11-14 2023-03-03 百思特管理咨询有限公司 Bidding information capture method based on software robot

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101515272A (en) * 2008-02-18 2009-08-26 株式会社理光 Method and device for extracting webpage content
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device
CN107992625A (en) * 2017-12-25 2018-05-04 湖南星汉数智科技有限公司 A kind of automatic abstracting method of web page form data and device
CN108959204A (en) * 2018-06-22 2018-12-07 中国科学院计算技术研究所 Internet monetary items information extraction method and system
CN109062874A (en) * 2018-06-12 2018-12-21 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of financial data
CN109710771A (en) * 2018-10-30 2019-05-03 北京百度网讯科技有限公司 Form data extracting method, device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101515272A (en) * 2008-02-18 2009-08-26 株式会社理光 Method and device for extracting webpage content
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN106156239A (en) * 2015-04-27 2016-11-23 中国移动通信集团公司 A kind of form abstracting method and device
CN107992625A (en) * 2017-12-25 2018-05-04 湖南星汉数智科技有限公司 A kind of automatic abstracting method of web page form data and device
CN109062874A (en) * 2018-06-12 2018-12-21 平安科技(深圳)有限公司 Acquisition methods, terminal device and the medium of financial data
CN108959204A (en) * 2018-06-22 2018-12-07 中国科学院计算技术研究所 Internet monetary items information extraction method and system
CN109710771A (en) * 2018-10-30 2019-05-03 北京百度网讯科技有限公司 Form data extracting method, device and storage medium

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server
CN113496119A (en) * 2020-03-20 2021-10-12 北京庖丁科技有限公司 Method, electronic device and computer readable medium for extracting tuple data in table
CN113496119B (en) * 2020-03-20 2024-06-21 北京庖丁科技有限公司 Method, electronic device and computer readable medium for extracting metadata in table
CN111401010A (en) * 2020-03-25 2020-07-10 苏州机数芯微科技有限公司 Form extraction method based on machine learning
CN111401010B (en) * 2020-03-25 2023-07-28 苏州机数芯微科技有限公司 Form extraction method based on machine learning
CN111427973A (en) * 2020-04-21 2020-07-17 贵州新致普惠信息技术有限公司 Planning table data analysis method
CN111913993A (en) * 2020-08-12 2020-11-10 望海康信(北京)科技股份公司 Table data generation method and device, electronic equipment and computer readable storage medium
CN111913993B (en) * 2020-08-12 2024-02-23 望海康信(北京)科技股份公司 Table data generation method, apparatus, electronic device and computer readable storage medium
CN112395418A (en) * 2020-11-26 2021-02-23 上海携宁计算机科技股份有限公司 Method and device for extracting target object in webpage and electronic equipment
CN112395418B (en) * 2020-11-26 2021-09-03 上海携宁计算机科技股份有限公司 Method and device for extracting target object in webpage and electronic equipment
CN112597927A (en) * 2020-12-28 2021-04-02 电子科技大学 Two-dimensional table identification method, device, equipment and system
CN113343658A (en) * 2021-07-01 2021-09-03 湖南四方天箭信息科技有限公司 PDF file information extraction method and device and computer equipment
CN113343658B (en) * 2021-07-01 2024-04-09 湖南四方天箭信息科技有限公司 PDF file information extraction method and device and computer equipment
CN113656592A (en) * 2021-07-22 2021-11-16 北京百度网讯科技有限公司 Data processing method and device based on knowledge graph, electronic equipment and medium
CN113656592B (en) * 2021-07-22 2022-09-27 北京百度网讯科技有限公司 Data processing method and device based on knowledge graph, electronic equipment and medium
CN113987112A (en) * 2021-12-24 2022-01-28 杭州恒生聚源信息技术有限公司 Table information extraction method and device, storage medium and electronic equipment
CN113987112B (en) * 2021-12-24 2022-04-08 杭州恒生聚源信息技术有限公司 Table information extraction method and device, storage medium and electronic equipment
CN115048916A (en) * 2022-05-27 2022-09-13 北京百度网讯科技有限公司 Table processing method and device
CN115730121A (en) * 2022-11-14 2023-03-03 百思特管理咨询有限公司 Bidding information capture method based on software robot

Also Published As

Publication number Publication date
CN110188107B (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN110188107A (en) A kind of method and device of the Extracting Information from table
US10740429B2 (en) Apparatus and method for acquiring, managing, sharing, monitoring, analyzing and publishing web-based time series data
CN103294781B (en) A kind of method and apparatus for processing page data
CN103268348B (en) A kind of user&#39;s query intention recognition methods
CN101681251B (en) From the semantic analysis of documents to rank phrase
CN106462559B (en) Arbitrary size content item generates
CN109493199A (en) Products Show method, apparatus, computer equipment and storage medium
CN109614550A (en) Public sentiment monitoring method, device, computer equipment and storage medium
Sriwannawit et al. Large-scale bibliometric review of diffusion research
Tanudjaja et al. Exploring bibliometric mapping in NUS using BibExcel and VOSviewer
CN110263009A (en) Generation method, device, equipment and the readable storage medium storing program for executing of log classifying rules
CN104850955B (en) The user information intelligent management and system of Network Environment
CN108959580A (en) A kind of optimization method and system of label data
CN106776808A (en) Information data offering method and device based on artificial intelligence
CN105786961A (en) Data sorting treatment method based on financial information
CN102193951A (en) Information extracting method and system
CN112182204A (en) Method and device for constructing corpus labeled by Chinese named entities
CN103699370A (en) SurvML (Survey Marked Language) design and development method based on XML (Extensive Markup Language)
CN111383042A (en) House resource recommendation method and device
Núñez-Tabales et al. Ten Years of Airbnb Phenomenon Research: A Bibliometric Approach (2010–2019)
CN104077281A (en) Method and device for generating advertising slogans
Kwon et al. Luxury brands and corporate social responsibility (CSR): exploring the differences between traditional and new luxury
CN106716403A (en) Automated generation of web site entry pages
JP6975118B2 (en) Extractor and program
CN106777124A (en) Semantic knowledge method, apparatus and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190906

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant after: China Science and Technology (Beijing) Co., Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd.

GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Patentee after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.