CN103218420A - 一种网页标题提取方法及装置 - Google Patents
一种网页标题提取方法及装置 Download PDFInfo
- Publication number
- CN103218420A CN103218420A CN2013101108540A CN201310110854A CN103218420A CN 103218420 A CN103218420 A CN 103218420A CN 2013101108540 A CN2013101108540 A CN 2013101108540A CN 201310110854 A CN201310110854 A CN 201310110854A CN 103218420 A CN103218420 A CN 103218420A
- Authority
- CN
- China
- Prior art keywords
- attribute
- classifier
- title
- sequence
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 44
- 239000013598 vector Substances 0.000 claims abstract description 42
- 238000012545 processing Methods 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 11
- 238000003066 decision tree Methods 0.000 claims description 7
- 238000012706 support-vector machine Methods 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013398 bayesian method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310110854.0A CN103218420B (zh) | 2013-04-01 | 2013-04-01 | 一种网页标题提取方法及装置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310110854.0A CN103218420B (zh) | 2013-04-01 | 2013-04-01 | 一种网页标题提取方法及装置 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103218420A true CN103218420A (zh) | 2013-07-24 |
CN103218420B CN103218420B (zh) | 2016-12-28 |
Family
ID=48816207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310110854.0A Expired - Fee Related CN103218420B (zh) | 2013-04-01 | 2013-04-01 | 一种网页标题提取方法及装置 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103218420B (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537028A (zh) * | 2014-12-19 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | 一种网页信息处理方法及装置 |
CN107506472A (zh) * | 2017-09-05 | 2017-12-22 | 淮阴工学院 | 一种学生浏览网页分类方法 |
CN108509794A (zh) * | 2018-03-09 | 2018-09-07 | 中山大学 | 一种基于分类学习算法的恶意网页防御检测方法 |
CN108829898A (zh) * | 2018-06-29 | 2018-11-16 | 无码科技(杭州)有限公司 | Html内容页发布时间提取方法和系统 |
CN110555198A (zh) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | 用于生成文章的方法、装置、设备和计算机可读存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079031A (zh) * | 2006-06-15 | 2007-11-28 | 腾讯科技(深圳)有限公司 | 一种网页主题提取系统和方法 |
CN101226548A (zh) * | 2008-01-11 | 2008-07-23 | 孟小峰 | 基于视觉的Web数据抽取系统和方法 |
US7451395B2 (en) * | 2002-12-16 | 2008-11-11 | Palo Alto Research Center Incorporated | Systems and methods for interactive topic-based text summarization |
CN102193944A (zh) * | 2010-03-12 | 2011-09-21 | 三星电子(中国)研发中心 | 网页主题内容抽取方法 |
CN102768663A (zh) * | 2011-05-05 | 2012-11-07 | 腾讯科技(深圳)有限公司 | 一种网页标题的提取方法、装置及信息处理系统 |
-
2013
- 2013-04-01 CN CN201310110854.0A patent/CN103218420B/zh not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7451395B2 (en) * | 2002-12-16 | 2008-11-11 | Palo Alto Research Center Incorporated | Systems and methods for interactive topic-based text summarization |
CN101079031A (zh) * | 2006-06-15 | 2007-11-28 | 腾讯科技(深圳)有限公司 | 一种网页主题提取系统和方法 |
CN101226548A (zh) * | 2008-01-11 | 2008-07-23 | 孟小峰 | 基于视觉的Web数据抽取系统和方法 |
CN102193944A (zh) * | 2010-03-12 | 2011-09-21 | 三星电子(中国)研发中心 | 网页主题内容抽取方法 |
CN102768663A (zh) * | 2011-05-05 | 2012-11-07 | 腾讯科技(深圳)有限公司 | 一种网页标题的提取方法、装置及信息处理系统 |
Non-Patent Citations (2)
Title |
---|
吴艳玲: "基于SVM的网页分类器的研究", 《中国优秀硕士论文全文数据库 信息科技辑》 * |
季桂树等: "决策树分类算法研究综述", 《科技广场》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537028A (zh) * | 2014-12-19 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | 一种网页信息处理方法及装置 |
CN104537028B (zh) * | 2014-12-19 | 2018-06-08 | 百度在线网络技术(北京)有限公司 | 一种网页信息处理方法及装置 |
CN107506472A (zh) * | 2017-09-05 | 2017-12-22 | 淮阴工学院 | 一种学生浏览网页分类方法 |
CN107506472B (zh) * | 2017-09-05 | 2020-09-08 | 淮阴工学院 | 一种学生浏览网页分类方法 |
CN108509794A (zh) * | 2018-03-09 | 2018-09-07 | 中山大学 | 一种基于分类学习算法的恶意网页防御检测方法 |
CN110555198A (zh) * | 2018-05-31 | 2019-12-10 | 北京百度网讯科技有限公司 | 用于生成文章的方法、装置、设备和计算机可读存储介质 |
CN110555198B (zh) * | 2018-05-31 | 2023-05-23 | 北京百度网讯科技有限公司 | 用于生成文章的方法、装置、设备和计算机可读存储介质 |
CN108829898A (zh) * | 2018-06-29 | 2018-11-16 | 无码科技(杭州)有限公司 | Html内容页发布时间提取方法和系统 |
CN108829898B (zh) * | 2018-06-29 | 2020-11-20 | 无码科技(杭州)有限公司 | Html内容页发布时间提取方法和系统 |
Also Published As
Publication number | Publication date |
---|---|
CN103218420B (zh) | 2016-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10055479B2 (en) | Joint approach to feature and document labeling | |
CN108717408B (zh) | 一种敏感词实时监控方法、电子设备、存储介质及系统 | |
Stein et al. | Intrinsic plagiarism analysis | |
US9514216B2 (en) | Automatic classification of segmented portions of web pages | |
US8503769B2 (en) | Matching text to images | |
US8630972B2 (en) | Providing context for web articles | |
CN112667940B (zh) | 基于深度学习的网页正文抽取方法 | |
CN100552673C (zh) | 开放式文档同构引擎系统 | |
CN103299324A (zh) | 使用潜在子标记来学习用于视频注释的标记 | |
Daxenberger et al. | Argumentext: argument classification and clustering in a generalized search scenario | |
CN104679902A (zh) | 一种结合跨媒体融合的信息摘要提取方法 | |
WO2023173555A1 (zh) | 模型的训练方法、文本分类方法和装置、设备、介质 | |
CN103218420B (zh) | 一种网页标题提取方法及装置 | |
Gopinath et al. | Supervised and unsupervised methods for robust separation of section titles and prose text in web documents | |
Zhang et al. | Annotating needles in the haystack without looking: Product information extraction from emails | |
CN102436512B (zh) | 一种基于偏好度的网页文本内容管控方法 | |
CN118377950A (zh) | 一种网页正文提取方法和装置 | |
CN118076982A (zh) | 信息提取和结构化方法 | |
Luo et al. | Web article extraction for web printing: a dom+ visual based approach | |
Nguyen et al. | Web document analysis based on visual segmentation and page rendering | |
Ferrés et al. | PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles | |
Kempf et al. | KIETA: Key-insight extraction from scientific tables | |
Ayyavaraiah et al. | Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives | |
CN114428847B (zh) | 用于争议焦点裁判文书的筛选的模型的训练方法 | |
Hossain et al. | An ensemble method-based machine learning approach using text mining to identify semantic fake news |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Owner name: BEIJING CHUANGSHI TAIKE TECHNOLOGY CO., LTD. Free format text: FORMER OWNER: BEIJING PENGYUCHENG SOFTWARE TECHNOLOGY CO., LTD. Effective date: 20150113 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20150113 Address after: 100088 Beijing City, Haidian District Zhichun Road Jinqiu International Building No. 6 A block 1602 Applicant after: Beijing Chuangshitaike Technology Co.,Ltd. Address before: 100088 Beijing City, Haidian District Zhichun Road Jinqiu International Building No. 6 A block 1602 Applicant before: BEIJING PYC SOFTWARE Co.,Ltd. |
|
CB02 | Change of applicant information |
Address after: 100088 Beijing City, Haidian District Zhichun Road No. 6 (Jinqiu International Building) A District 1309, 1310, 1601. Applicant after: Beijing Transtec Technology Co.,Ltd. Address before: 100088 Beijing City, Haidian District Zhichun Road Jinqiu International Building No. 6 A block 1602 Applicant before: Beijing Chuangshitaike Technology Co.,Ltd. |
|
COR | Change of bibliographic data | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161228 |