CN113392354B - 一种网页正文解析方法、系统、介质及电子设备 - Google Patents
一种网页正文解析方法、系统、介质及电子设备 Download PDFInfo
- Publication number
- CN113392354B CN113392354B CN202110719543.9A CN202110719543A CN113392354B CN 113392354 B CN113392354 B CN 113392354B CN 202110719543 A CN202110719543 A CN 202110719543A CN 113392354 B CN113392354 B CN 113392354B
- Authority
- CN
- China
- Prior art keywords
- node
- block
- text
- date
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 4
- 238000000034 method Methods 0.000 claims abstract description 44
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 230000008569 process Effects 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 11
- 238000004590 computer program Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (9)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110719543.9A CN113392354B (zh) | 2021-06-28 | 2021-06-28 | 一种网页正文解析方法、系统、介质及电子设备 |
ZA2021/08738A ZA202108738B (en) | 2021-06-28 | 2021-11-08 | Method, system, medium and electronic equipment for webpage main text analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110719543.9A CN113392354B (zh) | 2021-06-28 | 2021-06-28 | 一种网页正文解析方法、系统、介质及电子设备 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113392354A CN113392354A (zh) | 2021-09-14 |
CN113392354B true CN113392354B (zh) | 2022-09-13 |
Family
ID=77624199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110719543.9A Active CN113392354B (zh) | 2021-06-28 | 2021-06-28 | 一种网页正文解析方法、系统、介质及电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113392354B (zh) |
ZA (1) | ZA202108738B (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115203604A (zh) * | 2022-09-15 | 2022-10-18 | 成都数之联科技股份有限公司 | 一种网页正文提取方法及系统及装置及介质 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104268148B (zh) * | 2014-08-27 | 2018-02-06 | 中国科学院计算技术研究所 | 一种基于时间串的论坛页面信息自动抽取方法及系统 |
CN108520007B (zh) * | 2018-03-15 | 2021-09-28 | 江河瑞通(北京)技术有限公司 | 万维网网页信息提取方法、存储介质及计算机设备 |
CN108920434B (zh) * | 2018-06-06 | 2022-08-30 | 武汉酷犬数据科技有限公司 | 一种通用的网页主题内容提取方法和系统 |
CN111966930B (zh) * | 2020-08-17 | 2021-05-04 | 山东亿云信息技术有限公司 | 基于XPath序列的网页列表解析方法及系统 |
CN112395860A (zh) * | 2020-11-27 | 2021-02-23 | 山东省计算中心(国家超级计算济南中心) | 一种大规模并行政策数据知识抽取方法及系统 |
CN112230989B (zh) * | 2020-12-14 | 2021-03-12 | 北京智慧星光信息技术有限公司 | 网页频道导航栏提取方法、系统、电子设备及存储介质 |
-
2021
- 2021-06-28 CN CN202110719543.9A patent/CN113392354B/zh active Active
- 2021-11-08 ZA ZA2021/08738A patent/ZA202108738B/en unknown
Also Published As
Publication number | Publication date |
---|---|
CN113392354A (zh) | 2021-09-14 |
ZA202108738B (en) | 2022-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145482B (zh) | 基于人工智能的文章生成方法及装置、设备与可读介质 | |
CN107423391B (zh) | 网页结构化数据的信息提取方法 | |
CN101727461A (zh) | 一种网页的正文抽取方法 | |
CN109492177B (zh) | 一种基于网页语义结构的网页分块方法 | |
CN106503211B (zh) | 面向信息发布类网站的移动版自动生成的方法 | |
CN105740355B (zh) | 基于聚集文本密度的网页正文提取方法及装置 | |
Cardoso et al. | An efficient language-independent method to extract content from news webpages | |
CN108733813A (zh) | 面向bbs论坛网页内容的信息提取方法、系统及介质 | |
Uzun et al. | An effective and efficient Web content extractor for optimizing the crawling process | |
CN108921184A (zh) | 一种通用的网页类型判定方法 | |
CN114238575A (zh) | 文档解析方法、系统、计算机设备及计算机可读存储介质 | |
US20140156799A1 (en) | Method and System for Extracting Post Contents From Forum Web Page | |
CN107145591B (zh) | 一种基于标题的网页有效元数据内容提取方法 | |
CN107436931B (zh) | 网页正文抽取方法及装置 | |
CN113392354B (zh) | 一种网页正文解析方法、系统、介质及电子设备 | |
CN112818200A (zh) | 基于静态网站的数据爬取及事件分析方法及系统 | |
CN116245177A (zh) | 地理环境知识图谱自动化构建方法及系统、可读存储介质 | |
CN106372232B (zh) | 基于人工智能的信息挖掘方法和装置 | |
CN104572874B (zh) | 一种网页信息的抽取方法及装置 | |
CN108694192B (zh) | 网页类型的判断方法及装置 | |
CN106897287B (zh) | 网页发布时间抽取方法和用于网页发布时间抽取的装置 | |
CN117312711A (zh) | 一种基于ai分析的搜索引擎优化方法及系统 | |
CN110083760B (zh) | 一种基于可视块的多记录型动态网页信息提取方法 | |
CN101996190A (zh) | 一种从网页中抽取信息的方法及装置 | |
CN111966930B (zh) | 基于XPath序列的网页列表解析方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Xin Guomao Inventor after: Wang Ruishuang Inventor after: Wu Shiwei Inventor after: Chen Tong Inventor after: Lu Feng Inventor after: Yang Chun Inventor before: Xin Guomao Inventor before: Wang Ruishuang Inventor before: Wu Shiwei Inventor before: Chen Tong Inventor before: Lu Feng Inventor before: Yang Chun |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: Floor 12, Building 3, Shuntai Plaza, No. 2000 Shunhua Road, High tech Industrial Development Zone, Jinan City, Shandong Province, 250101 Patentee after: SHANDONG ECLOUD INFORMATION TECHNOLOGY CO.,LTD. Country or region after: China Address before: 250014 3rd floor, block B, Yinhe building, 2008 Xinluo street, high tech Zone, Jinan City, Shandong Province Patentee before: SHANDONG ECLOUD INFORMATION TECHNOLOGY CO.,LTD. Country or region before: China |