CN101515287B - Automatic generating method of wrapper of complex page - Google Patents
Automatic generating method of wrapper of complex page Download PDFInfo
- Publication number
- CN101515287B CN101515287B CN2009100295613A CN200910029561A CN101515287B CN 101515287 B CN101515287 B CN 101515287B CN 2009100295613 A CN2009100295613 A CN 2009100295613A CN 200910029561 A CN200910029561 A CN 200910029561A CN 101515287 B CN101515287 B CN 101515287B
- Authority
- CN
- China
- Prior art keywords
- html
- wrapper
- data
- relation
- data record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Abstract
Description
Input: two Url of original list |
Output: the forward Longest Common Substring of representing the Data-rich district |
Algorithm steps: |
(1) input is based on the Url of two pages of same module. |
(2) respectively the HTML Tag of two pages tree is carried out degree of depth recurrence, if find to have the paging navigation nodes in its subtree, then its father node of mark is the start node of step (3), otherwise is start node with the root node of HTMLTag tree. |
(3) root node with mark in the step (2) begins, and HTML Tag tree is carried out degree of depth recurrence relatively, judges whether its subtree is consistent.If the path unanimity then is labeled as unanimity to this subpath, turn back to father node, continue to choose next single sub path relatively.If all subpaths of father node are all consistent, then the path of representative is the noise branch. |
(4) the forward Longest Common Substring of the different subtrees that will obtain is exported, and does the tree path of Data-rich. |
The container class label | Modify the class label |
Table/tr/td/div/ul/li etc. | A/strong/font etc. |
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100295613A CN101515287B (en) | 2009-03-24 | 2009-03-24 | Automatic generating method of wrapper of complex page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100295613A CN101515287B (en) | 2009-03-24 | 2009-03-24 | Automatic generating method of wrapper of complex page |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101515287A CN101515287A (en) | 2009-08-26 |
CN101515287B true CN101515287B (en) | 2011-01-12 |
Family
ID=41039740
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009100295613A Expired - Fee Related CN101515287B (en) | 2009-03-24 | 2009-03-24 | Automatic generating method of wrapper of complex page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101515287B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101721A1 (en) * | 2010-10-21 | 2012-04-26 | Telenav, Inc. | Navigation system with xpath repetition based field alignment mechanism and method of operation thereof |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130091150A1 (en) * | 2010-06-30 | 2013-04-11 | Jian-Ming Jin | Determiining similarity between elements of an electronic document |
CN102651000A (en) * | 2011-02-28 | 2012-08-29 | 福建星网视易信息系统有限公司 | XML (extensible markup language)-based financial data display method and system |
CN102567530B (en) * | 2011-12-31 | 2014-06-11 | 凤凰在线(北京)信息技术有限公司 | Intelligent extraction system and intelligent extraction method for article type web pages |
US9235803B2 (en) * | 2012-04-19 | 2016-01-12 | Microsoft Technology Licensing, Llc | Linking web extension and content contextually |
CN103778104B (en) * | 2012-10-22 | 2017-05-03 | 富士通株式会社 | Information processing device, information processing method and electronic device |
CN105706078B (en) * | 2013-10-09 | 2021-08-03 | 谷歌有限责任公司 | Automatic definition of entity collections |
CN103761312B (en) * | 2014-01-24 | 2017-02-08 | 福州大学 | Information extraction system and method for multi-recording webpage |
CN105095306B (en) * | 2014-05-20 | 2019-04-09 | 阿里巴巴集团控股有限公司 | The method and device operated based on affiliated partner |
CN106095854B (en) * | 2016-06-02 | 2022-05-17 | 腾讯科技(深圳)有限公司 | Method and device for determining position information of information block |
CN107943929B (en) * | 2017-11-22 | 2021-09-28 | 福州大学 | Wrapper automatic generation method based on DOM tree abstraction |
CN108376153A (en) * | 2018-02-07 | 2018-08-07 | 厦门集微科技有限公司 | A kind of Webpage production method and device |
CN110222251B (en) * | 2019-05-27 | 2022-04-01 | 浙江大学 | Service packaging method based on webpage segmentation and search algorithm |
CN110399529A (en) * | 2019-07-23 | 2019-11-01 | 福建奇点时空数字科技有限公司 | A kind of data entity abstracting method based on depth learning technology |
CN115168714B (en) * | 2022-07-07 | 2023-11-10 | 中国测绘科学研究院 | Web API data extraction method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050273441A1 (en) * | 2004-05-21 | 2005-12-08 | Microsoft Corporation | xParts-schematized data wrapper |
CN101004760A (en) * | 2007-01-10 | 2007-07-25 | 苏州大学 | Method for extracting page query interface based on character of vision |
-
2009
- 2009-03-24 CN CN2009100295613A patent/CN101515287B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050273441A1 (en) * | 2004-05-21 | 2005-12-08 | Microsoft Corporation | xParts-schematized data wrapper |
CN101004760A (en) * | 2007-01-10 | 2007-07-25 | 苏州大学 | Method for extracting page query interface based on character of vision |
Non-Patent Citations (1)
Title |
---|
李亚桥等.基于树结构的包装器全自动生成方法的研究.《河北工业大学学报》.2007,第36卷(第6期),41-46. * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120101721A1 (en) * | 2010-10-21 | 2012-04-26 | Telenav, Inc. | Navigation system with xpath repetition based field alignment mechanism and method of operation thereof |
Also Published As
Publication number | Publication date |
---|---|
CN101515287A (en) | 2009-08-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101515287B (en) | Automatic generating method of wrapper of complex page | |
Gatterbauer et al. | Towards domain-independent information extraction from web tables | |
Liu et al. | Vide: A vision-based approach for deep web data extraction | |
CN103049575B (en) | A kind of academic conference search system of topic adaptation | |
Foley et al. | Learning to extract local events from the web | |
Cafarella et al. | Web-scale extraction of structured data | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
Muñoz et al. | Triplifying wikipedia's tables | |
Tao et al. | Automatic hidden-web table interpretation, conceptualization, and semantic annotation | |
Ji et al. | Tag tree template for Web information and schema extraction | |
Senellart et al. | Automatic wrapper induction from hidden-web sources with domain knowledge | |
CN103678412A (en) | Document retrieval method and device | |
Zhao et al. | Mining templates from search result records of search engines | |
Wen et al. | KAT: Keywords-to-SPARQL translation over RDF graphs | |
CN116467278A (en) | MongoDB storage-oriented temporal RDF four-tuple model and redundancy attribute elimination method | |
Weninger et al. | The parallel path framework for entity discovery on the web | |
Wu et al. | Extracting Web news using tag path patterns | |
Qiu et al. | Detection and optimized disposal of near-duplicate pages | |
Devezas et al. | Graph-of-entity: a model for combined data representation and retrieval | |
Zeng et al. | Layout-tree-based approach for identifying visually similar blocks in a web page | |
Deshmukh et al. | An improved approach for deep web data extraction | |
Chuang et al. | Improving the effectiveness of POI search by associated information summarization | |
Zhao | Automatic wrapper generation for the extraction of search result records from search engines | |
Kołaczkowski et al. | Extracting product descriptions from polish e-commerce websites using classification and clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Free format text: FORMER OWNER: FANG WEI ZHAO PENGPENG Owner name: SUZHOU PUDA NEW INFORMATION TECHNOLOGY CO., LTD. Free format text: FORMER OWNER: CUI ZHIMING Effective date: 20100524 |
|
C41 | Transfer of patent application or patent right or utility model | ||
COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 215001 ROOM 403, BUILDING 115, SUAN NEW HOUSING ESTATE, SUZHOU CITY, JIANGSU PROVINCE TO: 215021 NO.E101-18, PHASE 2, INTERNATIONAL SCIENCE PARK, NO.1355, JINJIHU AVENUE, SUZHOU INDUSTRY PARK, SUZHOU CITY, JIANGSU PROVINCE |
|
TA01 | Transfer of patent application right |
Effective date of registration: 20100524 Address after: 215021, 1355 international science and Technology Park, Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, two E101-18 Applicant after: Suzhou Production Information Technology Co., Ltd. Address before: 215001 room 115, building 403, Su an village, Suzhou, Jiangsu Applicant before: Cui Zhiming Co-applicant before: Fang Wei Co-applicant before: Zhao Pengpeng |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20090826 Assignee: SUZHOU SOUKE INFORMATION TECHNOLOGY CO., LTD. Assignor: Suzhou Production Information Technology Co., Ltd. Contract record no.: 2013320010068 Denomination of invention: Automatic generating method of wrapper of complex page Granted publication date: 20110112 License type: Exclusive License Record date: 20130412 |
|
LICC | Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model | ||
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20161011 Address after: Canglang District of Suzhou City, Jiangsu province 215021 liberation Village 5 403 room Patentee after: Shu Lan Address before: 215021, 1355 international science and Technology Park, Jinji Lake Avenue, Suzhou Industrial Park, Suzhou, Jiangsu, two E101-18 Patentee before: Suzhou Production Information Technology Co., Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20110112 Termination date: 20180324 |