CN108255870B - 一种网站数据爬取方法及装置 - Google Patents
一种网站数据爬取方法及装置 Download PDFInfo
- Publication number
- CN108255870B CN108255870B CN201611249114.5A CN201611249114A CN108255870B CN 108255870 B CN108255870 B CN 108255870B CN 201611249114 A CN201611249114 A CN 201611249114A CN 108255870 B CN108255870 B CN 108255870B
- Authority
- CN
- China
- Prior art keywords
- regular expression
- crawled
- uniform resource
- website data
- relation table
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000009193 crawling Effects 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000014509 gene expression Effects 0.000 claims abstract description 161
- 238000010586 diagram Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 8
- 244000089409 Erythrina poeppigiana Species 0.000 description 3
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611249114.5A CN108255870B (zh) | 2016-12-29 | 2016-12-29 | 一种网站数据爬取方法及装置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611249114.5A CN108255870B (zh) | 2016-12-29 | 2016-12-29 | 一种网站数据爬取方法及装置 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108255870A CN108255870A (zh) | 2018-07-06 |
CN108255870B true CN108255870B (zh) | 2021-06-01 |
Family
ID=62721393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611249114.5A Active CN108255870B (zh) | 2016-12-29 | 2016-12-29 | 一种网站数据爬取方法及装置 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108255870B (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507341A (zh) * | 2020-12-03 | 2021-03-16 | 广州万方计算机科技有限公司 | 基于网络爬虫的漏洞扫描方法、装置、设备和存储介质 |
CN112579934A (zh) * | 2021-02-03 | 2021-03-30 | 杭州普数软件有限公司 | 网站应用跳转与视图更新的方法及设备 |
CN113656659A (zh) * | 2021-08-31 | 2021-11-16 | 上海观安信息技术股份有限公司 | 一种数据提取方法、装置、系统及计算机可读存储介质 |
CN114900546B (zh) * | 2022-07-08 | 2022-09-16 | 支付宝(杭州)信息技术有限公司 | 一种数据处理方法、装置、设备及可读存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452463A (zh) * | 2007-12-05 | 2009-06-10 | 浙江大学 | 定向抓取页面资源的方法和装置 |
CN101504673A (zh) * | 2009-03-24 | 2009-08-12 | 阿里巴巴集团控股有限公司 | 一种识别疑似仿冒网站的方法与系统 |
CN102930059A (zh) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | 一种聚焦爬虫的设计方法 |
CN104008213A (zh) * | 2014-06-24 | 2014-08-27 | 电子科技大学 | 一种网页信息更新发现与统计的方法和装置 |
CN104050037A (zh) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | 一种基于指定电子商务网站的定向爬虫的实现方法 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7814084B2 (en) * | 2007-03-21 | 2010-10-12 | Schmap Inc. | Contact information capture and link redirection |
-
2016
- 2016-12-29 CN CN201611249114.5A patent/CN108255870B/zh active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452463A (zh) * | 2007-12-05 | 2009-06-10 | 浙江大学 | 定向抓取页面资源的方法和装置 |
CN101504673A (zh) * | 2009-03-24 | 2009-08-12 | 阿里巴巴集团控股有限公司 | 一种识别疑似仿冒网站的方法与系统 |
CN102930059A (zh) * | 2012-11-26 | 2013-02-13 | 电子科技大学 | 一种聚焦爬虫的设计方法 |
CN104050037A (zh) * | 2014-06-13 | 2014-09-17 | 淮阴工学院 | 一种基于指定电子商务网站的定向爬虫的实现方法 |
CN104008213A (zh) * | 2014-06-24 | 2014-08-27 | 电子科技大学 | 一种网页信息更新发现与统计的方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN108255870A (zh) | 2018-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108255870B (zh) | 一种网站数据爬取方法及装置 | |
CN101694658B (zh) | 基于新闻去重的网页爬虫的构建方法 | |
US20170185680A1 (en) | Chinese website classification method and system based on characteristic analysis of website homepage | |
CN105447184B (zh) | 信息抓取方法及装置 | |
US9436768B2 (en) | System and method for pushing and distributing promotion content | |
US20200004792A1 (en) | Automated website data collection method | |
CN106250402B (zh) | 一种网站分类方法及装置 | |
CN106250513A (zh) | 一种基于事件建模的事件个性化分类方法及系统 | |
US20160188723A1 (en) | Cloud website recommendation method and system based on terminal access statistics, and related device | |
CN104462301B (zh) | 一种网络数据的处理方法和装置 | |
CN104156490A (zh) | 基于文字识别检测可疑钓鱼网页的方法及装置 | |
CN106682925A (zh) | 广告内容的推荐方法及装置 | |
CN102411587A (zh) | 一种网页分类方法和装置 | |
US10073918B2 (en) | Classifying URLs | |
CN112115266B (zh) | 恶意网址的分类方法、装置、计算机设备和可读存储介质 | |
CN102855309A (zh) | 一种基于用户行为关联分析的信息推荐方法及装置 | |
CN107045507B (zh) | 网页爬取方法及装置 | |
US20090259649A1 (en) | System and method for detecting templates of a website using hyperlink analysis | |
CN105550359B (zh) | 一种基于垂直搜索的网页排序方法、装置及服务器 | |
CN106202349B (zh) | 网页分类字典生成方法及装置 | |
CN112131507A (zh) | 网站内容处理方法、装置、服务器和计算机可读存储介质 | |
CN107193870B (zh) | 网页内容的提取方法和系统 | |
CN112989824A (zh) | 信息推送方法及装置、电子设备及存储介质 | |
CN102902790B (zh) | 网页分类系统及方法 | |
CN106874368B (zh) | 一种rtb竞价广告位价值分析方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: No.19, Jiefang East Road, Hangzhou, Zhejiang Province, 310000 Patentee after: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd. Patentee after: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd. Address before: No. 19, Jiefang East Road, Hangzhou, Zhejiang Province, 310016 Patentee before: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd. Patentee before: CHINA MOBILE COMMUNICATIONS Corp. |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231211 Address after: No.19, Jiefang East Road, Hangzhou, Zhejiang Province, 310000 Patentee after: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd. Patentee after: China Mobile (Zhejiang) Innovation Research Institute Co.,Ltd. Patentee after: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd. Address before: No.19, Jiefang East Road, Hangzhou, Zhejiang Province, 310000 Patentee before: CHINA MOBILE GROUP ZHEJIANG Co.,Ltd. Patentee before: CHINA MOBILE COMMUNICATIONS GROUP Co.,Ltd. |