CN102156737A - Method for extracting subject content of Chinese webpage - Google Patents
Method for extracting subject content of Chinese webpage Download PDFInfo
- Publication number
- CN102156737A CN102156737A CN 201110090737 CN201110090737A CN102156737A CN 102156737 A CN102156737 A CN 102156737A CN 201110090737 CN201110090737 CN 201110090737 CN 201110090737 A CN201110090737 A CN 201110090737A CN 102156737 A CN102156737 A CN 102156737A
- Authority
- CN
- China
- Prior art keywords
- node
- condition
- result
- filtrator
- dom
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000001914 filtration Methods 0.000 claims abstract description 17
- 239000000284 extract Substances 0.000 claims abstract description 9
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims description 22
- 238000004458 analytical method Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 9
- 230000006835 compression Effects 0.000 claims description 8
- 238000007906 compression Methods 0.000 claims description 8
- 238000001514 detection method Methods 0.000 claims description 7
- 230000004048 modification Effects 0.000 claims description 3
- 238000012986 modification Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000004883 computer application Methods 0.000 abstract 1
- 230000000007 visual effect Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110090737 CN102156737B (en) | 2011-04-12 | 2011-04-12 | Method for extracting subject content of Chinese webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110090737 CN102156737B (en) | 2011-04-12 | 2011-04-12 | Method for extracting subject content of Chinese webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102156737A true CN102156737A (en) | 2011-08-17 |
CN102156737B CN102156737B (en) | 2013-03-20 |
Family
ID=44438236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110090737 Active CN102156737B (en) | 2011-04-12 | 2011-04-12 | Method for extracting subject content of Chinese webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102156737B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662966A (en) * | 2012-03-08 | 2012-09-12 | 中国科学院计算机网络信息中心 | Method and system for obtaining subject-oriented dynamic page content |
CN102955852A (en) * | 2012-11-01 | 2013-03-06 | 北京小米科技有限责任公司 | Method, device and equipment for webpage resource processing |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
CN103064966A (en) * | 2012-12-31 | 2013-04-24 | 中国科学院计算技术研究所 | Method for extracting regular noise from single record web pages |
CN103353842A (en) * | 2013-06-20 | 2013-10-16 | 北京小米科技有限责任公司 | Webpage loading method and device |
CN103425644A (en) * | 2012-05-14 | 2013-12-04 | 腾讯科技(深圳)有限公司 | Method and device for extracting pictures in webpage content |
CN103678335A (en) * | 2012-09-05 | 2014-03-26 | 阿里巴巴集团控股有限公司 | Method and device for identifying commodity with labels and method for commodity navigation |
CN103838801A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Webpage theme information extraction method |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN103927309A (en) * | 2013-01-14 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Method and device for marking information labels for business objects |
CN104376061A (en) * | 2014-11-10 | 2015-02-25 | 武汉传神信息技术有限公司 | Webpage text extracting method |
CN104965849A (en) * | 2015-03-31 | 2015-10-07 | 哈尔滨工程大学 | Webpage-undeformed noise filtering method based on similarity of WVP_DOM tree |
CN107145591A (en) * | 2017-05-17 | 2017-09-08 | 广州瞬速信息科技有限公司 | A kind of effective content metadata extracting method of webpage based on title |
CN107391675A (en) * | 2017-07-21 | 2017-11-24 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating structure information |
CN109325204A (en) * | 2018-09-13 | 2019-02-12 | 武汉伯远生物科技有限公司 | Web page contents extraction method |
CN110110252A (en) * | 2019-05-17 | 2019-08-09 | 北京市博汇科技股份有限公司 | A kind of audiovisual material recognition methods, device and storage medium |
CN111709230A (en) * | 2020-04-30 | 2020-09-25 | 昆明理工大学 | Short text automatic summarization method based on part-of-speech soft template attention mechanism |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1934807A2 (en) * | 2005-08-09 | 2008-06-25 | Zalag Corporation | Methods and apparatuses to assemble, extract and deploy content from electronic documents |
US7669119B1 (en) * | 2005-07-20 | 2010-02-23 | Alexa Internet | Correlation-based information extraction from markup language documents |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
CN102004805A (en) * | 2010-12-30 | 2011-04-06 | 上海交通大学 | Webpage denoising system and method based on maximum similarity matching |
-
2011
- 2011-04-12 CN CN 201110090737 patent/CN102156737B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7669119B1 (en) * | 2005-07-20 | 2010-02-23 | Alexa Internet | Correlation-based information extraction from markup language documents |
EP1934807A2 (en) * | 2005-08-09 | 2008-06-25 | Zalag Corporation | Methods and apparatuses to assemble, extract and deploy content from electronic documents |
CN101727498A (en) * | 2010-01-15 | 2010-06-09 | 西安交通大学 | Automatic extraction method of web page information based on WEB structure |
CN102004805A (en) * | 2010-12-30 | 2011-04-06 | 上海交通大学 | Webpage denoising system and method based on maximum similarity matching |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662966A (en) * | 2012-03-08 | 2012-09-12 | 中国科学院计算机网络信息中心 | Method and system for obtaining subject-oriented dynamic page content |
CN103425644B (en) * | 2012-05-14 | 2016-04-06 | 腾讯科技(深圳)有限公司 | The extracting method of picture and device in Web page text |
CN103425644A (en) * | 2012-05-14 | 2013-12-04 | 腾讯科技(深圳)有限公司 | Method and device for extracting pictures in webpage content |
CN103678335B (en) * | 2012-09-05 | 2017-12-08 | 阿里巴巴集团控股有限公司 | The method of method, apparatus and the commodity navigation of commodity sign label |
CN103678335A (en) * | 2012-09-05 | 2014-03-26 | 阿里巴巴集团控股有限公司 | Method and device for identifying commodity with labels and method for commodity navigation |
CN102955852A (en) * | 2012-11-01 | 2013-03-06 | 北京小米科技有限责任公司 | Method, device and equipment for webpage resource processing |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN103838801A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Webpage theme information extraction method |
CN103064966B (en) * | 2012-12-31 | 2016-01-27 | 中国科学院计算技术研究所 | A kind of method extracting rule noise from unirecord webpage |
CN103064966A (en) * | 2012-12-31 | 2013-04-24 | 中国科学院计算技术研究所 | Method for extracting regular noise from single record web pages |
CN103927309A (en) * | 2013-01-14 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Method and device for marking information labels for business objects |
CN103927309B (en) * | 2013-01-14 | 2017-08-11 | 阿里巴巴集团控股有限公司 | A kind of method and device to business object markup information label |
CN103064827A (en) * | 2013-01-16 | 2013-04-24 | 盘古文化传播有限公司 | Method and device for extracting webpage content |
CN103353842A (en) * | 2013-06-20 | 2013-10-16 | 北京小米科技有限责任公司 | Webpage loading method and device |
CN104376061A (en) * | 2014-11-10 | 2015-02-25 | 武汉传神信息技术有限公司 | Webpage text extracting method |
CN104965849A (en) * | 2015-03-31 | 2015-10-07 | 哈尔滨工程大学 | Webpage-undeformed noise filtering method based on similarity of WVP_DOM tree |
CN104965849B (en) * | 2015-03-31 | 2018-12-07 | 哈尔滨工程大学 | A kind of indeformable noise filtering method of webpage based on WVP_DOM tree similitude |
CN107145591B (en) * | 2017-05-17 | 2020-10-16 | 广州瞬速信息科技有限公司 | Title-based webpage effective metadata content extraction method |
CN107145591A (en) * | 2017-05-17 | 2017-09-08 | 广州瞬速信息科技有限公司 | A kind of effective content metadata extracting method of webpage based on title |
CN107391675A (en) * | 2017-07-21 | 2017-11-24 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating structure information |
CN109325204A (en) * | 2018-09-13 | 2019-02-12 | 武汉伯远生物科技有限公司 | Web page contents extraction method |
CN109325204B (en) * | 2018-09-13 | 2022-01-07 | 武汉伯远生物科技有限公司 | Automatic extraction method of webpage content |
CN110110252A (en) * | 2019-05-17 | 2019-08-09 | 北京市博汇科技股份有限公司 | A kind of audiovisual material recognition methods, device and storage medium |
CN110110252B (en) * | 2019-05-17 | 2021-01-15 | 北京市博汇科技股份有限公司 | Audio-visual program identification method, device and storage medium |
CN111709230A (en) * | 2020-04-30 | 2020-09-25 | 昆明理工大学 | Short text automatic summarization method based on part-of-speech soft template attention mechanism |
CN111709230B (en) * | 2020-04-30 | 2023-04-07 | 昆明理工大学 | Short text automatic summarization method based on part-of-speech soft template attention mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN102156737B (en) | 2013-03-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102156737B (en) | Method for extracting subject content of Chinese webpage | |
Hattori et al. | Robust web page segmentation for mobile terminal using content-distances and page layout information | |
CN102663023B (en) | Implementation method for extracting web content | |
CN101515272B (en) | Method and device for extracting webpage content | |
CN102253979B (en) | Vision-based web page extracting method | |
CN102073726B (en) | Structured data import method and device for search engine system | |
CN104598577B (en) | A kind of extracting method of Web page text | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN108920434A (en) | A kind of general Web page subject method for extracting content and system | |
CN102279894A (en) | Method for searching, integrating and providing comment information based on semantics and searching system | |
CN109815386B (en) | User portrait-based construction method and device and storage medium | |
CN106503211B (en) | Method for automatically generating mobile version facing information publishing website | |
CN110263248A (en) | A kind of information-pushing method, device, storage medium and server | |
CN103955529A (en) | Internet information searching and aggregating presentation method | |
CN106909663A (en) | Based on tagging user Brang Preference behavior prediction method and its device | |
JP2005063432A (en) | Multimedia object retrieval apparatus and multimedia object retrieval method | |
Ahmadi et al. | User-centric adaptation of Web information for small screens | |
CN110222251A (en) | A kind of Service encapsulating method based on Web-page segmentation and searching algorithm | |
CN110134844A (en) | Subdivision field public sentiment monitoring method, device, computer equipment and storage medium | |
Nyein | Mining contents in Web page using cosine similarity | |
CN114443928B (en) | Web text data crawler method and system | |
JP2008269069A (en) | Information processing system and method | |
CN105243120A (en) | Retrieval method and apparatus | |
Liu et al. | Main content extraction from web pages based on node characteristics | |
CN101593187B (en) | Method and system for managing book marks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20110817 Assignee: Wuhan Hezhongxing Trading Co.,Ltd. Assignor: CENTRAL CHINA NORMAL University Contract record no.: X2023980052458 Denomination of invention: A Method for Extracting Theme Content from Chinese Web Pages Granted publication date: 20130320 License type: Common License Record date: 20231219 |
|
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20110817 Assignee: Hubei ZHENGBO Xusheng Technology Co.,Ltd. Assignor: CENTRAL CHINA NORMAL University Contract record no.: X2024980001275 Denomination of invention: A Method for Extracting Theme Content from Chinese Web Pages Granted publication date: 20130320 License type: Common License Record date: 20240124 |
|
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20110817 Assignee: Hubei Rongzhi Youan Technology Co.,Ltd. Assignor: CENTRAL CHINA NORMAL University Contract record no.: X2024980001548 Denomination of invention: A Method for Extracting Theme Content from Chinese Web Pages Granted publication date: 20130320 License type: Common License Record date: 20240126 |
|
EE01 | Entry into force of recordation of patent licensing contract |