CN106528509A - 网页信息提取方法及装置 - Google Patents
网页信息提取方法及装置 Download PDFInfo
- Publication number
- CN106528509A CN106528509A CN201610995251.7A CN201610995251A CN106528509A CN 106528509 A CN106528509 A CN 106528509A CN 201610995251 A CN201610995251 A CN 201610995251A CN 106528509 A CN106528509 A CN 106528509A
- Authority
- CN
- China
- Prior art keywords
- punctuation mark
- leaf node
- node
- webpage
- punctuation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610995251.7A CN106528509B (zh) | 2016-11-11 | 2016-11-11 | 网页信息提取方法及装置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610995251.7A CN106528509B (zh) | 2016-11-11 | 2016-11-11 | 网页信息提取方法及装置 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106528509A true CN106528509A (zh) | 2017-03-22 |
CN106528509B CN106528509B (zh) | 2020-04-03 |
Family
ID=58351328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610995251.7A Active CN106528509B (zh) | 2016-11-11 | 2016-11-11 | 网页信息提取方法及装置 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106528509B (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409088A (zh) * | 2017-08-18 | 2019-03-01 | 刘俊 | 一种网页信息的提取方法及装置 |
CN111625749A (zh) * | 2020-06-01 | 2020-09-04 | 深圳市小满科技有限公司 | 参会公司网站详情页信息提取方法、装置、设备及介质 |
CN111698364A (zh) * | 2020-06-19 | 2020-09-22 | 深圳市小满科技有限公司 | 联系人信息提取方法及相关设备 |
CN115391711A (zh) * | 2022-10-28 | 2022-11-25 | 中新宽维传媒科技有限公司 | 网页正文信息提取方法、装置、设备及介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060161564A1 (en) * | 2004-12-20 | 2006-07-20 | Samuel Pierre | Method and system for locating information in the invisible or deep world wide web |
CN102591612A (zh) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | 一种基于标点连续性的通用网页正文提取方法及其系统 |
CN102663023A (zh) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | 一种提取网页内容的实现方法 |
-
2016
- 2016-11-11 CN CN201610995251.7A patent/CN106528509B/zh active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060161564A1 (en) * | 2004-12-20 | 2006-07-20 | Samuel Pierre | Method and system for locating information in the invisible or deep world wide web |
CN102591612A (zh) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | 一种基于标点连续性的通用网页正文提取方法及其系统 |
CN102663023A (zh) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | 一种提取网页内容的实现方法 |
Non-Patent Citations (2)
Title |
---|
安增文,徐杰锋: "基于视觉特征的网页正文提取方法研究", 《网络与通信》 * |
杨钦,杨沐昀: "一种基于标点密度的网页正文提取方法", 《智能计算机与应用》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409088A (zh) * | 2017-08-18 | 2019-03-01 | 刘俊 | 一种网页信息的提取方法及装置 |
CN111625749A (zh) * | 2020-06-01 | 2020-09-04 | 深圳市小满科技有限公司 | 参会公司网站详情页信息提取方法、装置、设备及介质 |
CN111625749B (zh) * | 2020-06-01 | 2023-08-11 | 深圳市小满科技有限公司 | 参会公司网站详情页信息提取方法、装置、设备及介质 |
CN111698364A (zh) * | 2020-06-19 | 2020-09-22 | 深圳市小满科技有限公司 | 联系人信息提取方法及相关设备 |
CN111698364B (zh) * | 2020-06-19 | 2021-09-21 | 深圳市小满科技有限公司 | 联系人信息提取方法、相关设备及计算机可读存储介质 |
CN115391711A (zh) * | 2022-10-28 | 2022-11-25 | 中新宽维传媒科技有限公司 | 网页正文信息提取方法、装置、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
CN106528509B (zh) | 2020-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102184189B (zh) | 基于dom节点文本密度的网页核心块确定方法 | |
CN102663023B (zh) | 一种提取网页内容的实现方法 | |
CN105630941B (zh) | 基于统计和网页结构的Web正文内容抽取方法 | |
CN102254014B (zh) | 一种网页特征自适应的信息抽取方法 | |
CN102270206A (zh) | 一种有效网页内容的抓取方法及装置 | |
CN104598577B (zh) | 一种网页正文的提取方法 | |
CN102591612B (zh) | 一种基于标点连续性的通用网页正文提取方法及其系统 | |
CN102253930B (zh) | 一种文本翻译的方法及装置 | |
JP2006004417A (ja) | 情報ファイルの特定のタイプを認識する方法及び装置 | |
CN101727498A (zh) | 一种基于web结构的网页信息自动提取方法 | |
CN109086361B (zh) | 一种基于网页节点间互信息的网页文章信息自动抽取方法及系统 | |
CN102541874A (zh) | 网页正文内容提取方法及装置 | |
CN106528509A (zh) | 网页信息提取方法及装置 | |
RU2003134278A (ru) | Способ и считываемый компьютером носитель для импорта и экспорта иерархически структурированных данных | |
CN102915361B (zh) | 一种基于文字分布特征的网页正文提取方法 | |
CN102609427A (zh) | 舆情垂直搜索分析系统及方法 | |
CN107871002B (zh) | 一种基于指纹融合的跨语言剽窃检测方法 | |
CN109165373B (zh) | 一种数据处理方法及装置 | |
CN103810251A (zh) | 一种文本提取方法及装置 | |
CN109657114B (zh) | 一种抽取网页半结构化数据的方法 | |
CN107894974A (zh) | 基于标签路径和文本标点比特征融合的网页正文提取方法 | |
CN111190873B (zh) | 一种用于云原生系统日志训练的日志模式提取方法及系统 | |
CN108694192B (zh) | 网页类型的判断方法及装置 | |
CN104572787A (zh) | 伪原创网站的识别方法及装置 | |
CN106897287B (zh) | 网页发布时间抽取方法和用于网页发布时间抽取的装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Web page information extraction method and device Effective date of registration: 20220214 Granted publication date: 20200403 Pledgee: Ji'nan rural commercial bank Limited by Share Ltd. high tech branch Pledgor: ZHENGHE TECHNOLOGY Co.,Ltd. Registration number: Y2022980001521 |
|
PC01 | Cancellation of the registration of the contract for pledge of patent right | ||
PC01 | Cancellation of the registration of the contract for pledge of patent right |
Date of cancellation: 20221212 Granted publication date: 20200403 Pledgee: Ji'nan rural commercial bank Limited by Share Ltd. high tech branch Pledgor: ZHENGHE TECHNOLOGY Co.,Ltd. Registration number: Y2022980001521 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Web page information extraction method and device Effective date of registration: 20230203 Granted publication date: 20200403 Pledgee: Ji'nan rural commercial bank Limited by Share Ltd. high tech branch Pledgor: ZHENGHE TECHNOLOGY Co.,Ltd. Registration number: Y2023980031993 |