CN104991904A - Page data acquisition method of dynamic webpage - Google Patents
Page data acquisition method of dynamic webpage Download PDFInfo
- Publication number
- CN104991904A CN104991904A CN201510332025.6A CN201510332025A CN104991904A CN 104991904 A CN104991904 A CN 104991904A CN 201510332025 A CN201510332025 A CN 201510332025A CN 104991904 A CN104991904 A CN 104991904A
- Authority
- CN
- China
- Prior art keywords
- page
- data
- url
- data acquisition
- dynamic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000007418 data mining Methods 0.000 claims abstract description 5
- 238000005065 mining Methods 0.000 claims description 6
- 241000270322 Lepidosauria Species 0.000 claims description 5
- 238000002347 injection Methods 0.000 claims description 3
- 239000007924 injection Substances 0.000 claims description 3
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a page data acquisition method of a dynamic webpage, which comprises the following concrete implementation processes: and analyzing the environment by using the script, embedding the environment into the distributed web crawler, and realizing data acquisition of the dynamic page through the data mining, indexing and searching functions of the web crawler. Compared with the prior art, the method for acquiring the page data of the dynamic webpage has the advantages that various dynamic data are acquired in a complete form and stored in the database, people can know the dynamic state of the internet in real time conveniently, the condition that the acquired data are inaccurate and untimely is avoided, the defect that the acquired page is acquired once and is not acquired according to requirements in the traditional acquisition method is overcome, the acquisition accuracy and the acquisition efficiency are greatly improved, the practicability is high, the application range is wide, and the method is easy to popularize.
Description
Technical field
The present invention relates to large data technique field, specifically a kind of page data acquisition method of practical, dynamic web page.
Background technology
Current, along with the fast development of network technology, dynamic page proportion internet being embedded with JavaScript script is increasing, brings very large difficulty to page data collecting work.In network public opinion and search engine research, although the main object that page data gathers still is static page, the demand gathered the data in dynamic page is more and more urgent.
Traditional collecting method, can only obtain the static data in webpage, the data changed dynamically, in real time for some are helpless, use traditional acquisition method, not only waste a large amount of manpowers and time, and collection effect and the quality of data also very poor.
Based on this, now provide a kind of page data acquisition method of dynamic web page, the method is by Nutch data acquisition process, and Nutch is the search engine that the Java that increases income realizes.It provide the whole instruments needed for search engine that we run oneself.Comprise full-text search and Web reptile.Nutch is the search engine that the Java that increases income realizes, and utilizes Nutch web crawlers technology, builds dynamic page and automatically resolves task, effectively can solve the shortcoming of traditional html page capture technology, improve collecting efficiency and acquisition cost.
Summary of the invention
Technical assignment of the present invention is for above weak point, provides a kind of page data acquisition method of practical, dynamic web page.
A page data acquisition method for dynamic web page, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.
Described dynamic page data acquisition mining process is:
First create original url list, inject original URL;
Generate and capture list, and by network in webpage capture data;
The web data content collected by resolver resolves, obtains relevant page info;
Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;
Index process is:
The webpage that reverse indexing gathers, deletes content and the URL of redundancy;
Little index is synthesized large index, and sets up index database;
Search procedure is:
The interactive interface that user is provided by search engine sends searching request;
After search engine completes search procedure, by result feedback to user.
Described original URL is empty URL storehouse, and the original URL of injection is initial root URL.
Described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.
Described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:
Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;
Index part is made reverse indexing to search for by gathering the image data of returning;
Related data is searched in the input data search of the user interface that search part provides according to Nutch.
When user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.
The page data acquisition method of a kind of dynamic web page of the present invention, has the following advantages:
Various dynamic data collects with complete form by the page data acquisition method of a kind of dynamic web page of this invention, in the middle of database, us are facilitated to understand internet in real time dynamic, avoid that image data is inaccurate, situation not in time, compensate in traditional acquisition method the shortcoming gathering the page and only gather a time and do not gather according to demand, greatly improve accuracy and the collecting efficiency of collection, practical, applied widely, be easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 realizes schematic diagram for of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
The problem that in dynamic data in webpage or webpage, Dynamic Data Acquiring rate is on the low side, acquisition cost is very high cannot be gathered for solving in prior art, the page data acquisition method that a kind of dynamic web page is provided of the present invention, the present invention is mainly for dynamic data increasing on internet, as news data, BBS data and network public-opinion data etc., carry out the process dynamically captured.Script is resolved environment and is embedded in distributed network reptile by the program, achieves the data acquisition of dynamic page.Utilize perfect Nutch data mining and index function, revise operation steps, reach the object that we efficiently capture dynamic data.
As shown in Figure 1, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.
Described dynamic page data acquisition mining process is:
First create original url list, inject original URL;
Generate and capture list, and by network in webpage capture data;
The web data content collected by resolver resolves, obtains relevant page info;
Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;
Repeat above-mentioned steps, till reaching designated depth always.
Index process is:
The webpage that reverse indexing gathers, deletes content and the URL of redundancy;
Little index is synthesized large index, and sets up index database;
Search procedure is:
The interactive interface that user is provided by search engine sends searching request;
After search engine completes search procedure, by result feedback to user.
Described original URL is empty URL storehouse, and the original URL of injection is initial root URL.
Described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.
Described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:
Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;
Index part is made reverse indexing to search for by gathering the image data of returning;
Related data is searched in the input data search of the user interface that search part provides according to Nutch.
When user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of the page data acquisition method of any a kind of dynamic web page according to the invention and any person of an ordinary skill in the technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.
Claims (6)
1. the page data acquisition method of a dynamic web page, it is characterized in that, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.
2. the page data acquisition method of a kind of dynamic web page according to claim 1, is characterized in that, described dynamic page data acquisition mining process is:
First create original url list, inject original URL;
Generate and capture list, and by network in webpage capture data;
The web data content collected by resolver resolves, obtains relevant page info;
Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;
Index process is:
The webpage that reverse indexing gathers, deletes content and the URL of redundancy;
Little index is synthesized large index, and sets up index database;
Search procedure is:
The interactive interface that user is provided by search engine sends searching request;
After search engine completes search procedure, by result feedback to user.
3. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described original URL is empty URL storehouse, and the original URL of injection is initial root URL.
4. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.
5. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:
Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;
Index part is made reverse indexing to search for by gathering the image data of returning;
Related data is searched in the input data search of the user interface that search part provides according to Nutch.
6. the page data acquisition method of a kind of dynamic web page according to claim 5, is characterized in that, when user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510332025.6A CN104991904A (en) | 2015-06-16 | 2015-06-16 | Page data acquisition method of dynamic webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510332025.6A CN104991904A (en) | 2015-06-16 | 2015-06-16 | Page data acquisition method of dynamic webpage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104991904A true CN104991904A (en) | 2015-10-21 |
Family
ID=54303720
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510332025.6A Pending CN104991904A (en) | 2015-06-16 | 2015-06-16 | Page data acquisition method of dynamic webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104991904A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912584A (en) * | 2016-04-01 | 2016-08-31 | 南京奥灵克物联网科技有限公司 | Data index system based on webpage information data |
CN106649353A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Webpage data collection method and apparatus |
CN107066492A (en) * | 2016-12-29 | 2017-08-18 | 百视通网络电视技术发展有限责任公司 | Matchmaker provides metadata acquisition method and system |
CN108153595A (en) * | 2018-01-18 | 2018-06-12 | 成都无糖信息技术有限公司 | A kind of big data distributed task scheduling processing unit based on python |
CN108520043A (en) * | 2018-03-30 | 2018-09-11 | 纳思达股份有限公司 | Data object acquisition method, apparatus and system, computer readable storage medium |
CN108984801A (en) * | 2018-08-22 | 2018-12-11 | 百卓网络科技有限公司 | A kind of search engine optimization method identifying asynchronous loading content based on html tag |
CN110069684A (en) * | 2017-09-30 | 2019-07-30 | 北京国双科技有限公司 | A kind of data crawling method, device, storage medium and processor |
CN116861058A (en) * | 2023-09-04 | 2023-10-10 | 浪潮软件股份有限公司 | Public opinion monitoring system and method applied to government affair field |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040044962A1 (en) * | 2001-05-08 | 2004-03-04 | Green Jacob William | Relevant search rankings using high refresh-rate distributed crawling |
CN104050003A (en) * | 2014-06-27 | 2014-09-17 | 浪潮集团有限公司 | Method for starting Nutch collecting system with shell script |
CN104516982A (en) * | 2015-01-06 | 2015-04-15 | 南通大学 | Method and system for extracting Web information based on Nutch |
-
2015
- 2015-06-16 CN CN201510332025.6A patent/CN104991904A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040044962A1 (en) * | 2001-05-08 | 2004-03-04 | Green Jacob William | Relevant search rankings using high refresh-rate distributed crawling |
CN104050003A (en) * | 2014-06-27 | 2014-09-17 | 浪潮集团有限公司 | Method for starting Nutch collecting system with shell script |
CN104516982A (en) * | 2015-01-06 | 2015-04-15 | 南通大学 | Method and system for extracting Web information based on Nutch |
Non-Patent Citations (2)
Title |
---|
常智荣: "搜索引擎Nutch在数字图书馆中集成应用的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
许武权: "基于Web文本信息的智能检索系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649353A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Webpage data collection method and apparatus |
CN106649353B (en) * | 2015-10-30 | 2020-05-22 | 北京国双科技有限公司 | Method and device for collecting webpage data |
CN105912584A (en) * | 2016-04-01 | 2016-08-31 | 南京奥灵克物联网科技有限公司 | Data index system based on webpage information data |
CN105912584B (en) * | 2016-04-01 | 2020-07-31 | 南京奥灵克物联网科技有限公司 | Data indexing system based on webpage information data |
CN107066492A (en) * | 2016-12-29 | 2017-08-18 | 百视通网络电视技术发展有限责任公司 | Matchmaker provides metadata acquisition method and system |
CN110069684A (en) * | 2017-09-30 | 2019-07-30 | 北京国双科技有限公司 | A kind of data crawling method, device, storage medium and processor |
CN108153595A (en) * | 2018-01-18 | 2018-06-12 | 成都无糖信息技术有限公司 | A kind of big data distributed task scheduling processing unit based on python |
CN108520043A (en) * | 2018-03-30 | 2018-09-11 | 纳思达股份有限公司 | Data object acquisition method, apparatus and system, computer readable storage medium |
CN108984801A (en) * | 2018-08-22 | 2018-12-11 | 百卓网络科技有限公司 | A kind of search engine optimization method identifying asynchronous loading content based on html tag |
CN116861058A (en) * | 2023-09-04 | 2023-10-10 | 浪潮软件股份有限公司 | Public opinion monitoring system and method applied to government affair field |
CN116861058B (en) * | 2023-09-04 | 2024-04-12 | 浪潮软件股份有限公司 | Public opinion monitoring system and method applied to government affair field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104991904A (en) | Page data acquisition method of dynamic webpage | |
CN104915398B (en) | A kind of webpage buries method and device a little | |
CN102710795B (en) | Hotspot collecting method and device | |
CN102521232B (en) | Distributed acquisition and processing system and method of internet metadata | |
CN104182506A (en) | Log management method | |
CN102254027A (en) | Method for obtaining webpage contents in batch | |
CN103942335A (en) | Construction method of uninterrupted crawler system oriented to web page structure change | |
CN103309884A (en) | User behavior data collecting method and system | |
CN103699389A (en) | Linux kernel module relation extracting method based on compiling options | |
CN105069087A (en) | Web log data mining based website optimization method | |
JP2009048380A5 (en) | ||
CN105320754A (en) | Data searching system and method | |
CN104615627A (en) | Event public sentiment information extracting method and system based on micro-blog platform | |
CN103927400A (en) | Web site product detailed information classification crawling and product information base establishing method | |
CN104391706A (en) | Reverse engineering based model base structuring method | |
CN104102658A (en) | Method and device for mining text contents | |
CN105162822A (en) | Website log data processing method and device | |
CN101354706A (en) | Method and apparatus for collecting web page information | |
CN102508884A (en) | Method and device for acquiring hotpot events and real-time comments | |
CN104239472A (en) | Method and device for providing object information | |
CN104317857A (en) | House information acquisition service system | |
CN105426407A (en) | Web data acquisition method based on content analysis | |
CN103744944A (en) | Method for re-filtering in webpage or data crawling by web crawler | |
CN103838797A (en) | Method for optimizing mobile search engine | |
CN104317880A (en) | Method special for microblog data acquisition mode |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20151021 |
|
WD01 | Invention patent application deemed withdrawn after publication |