CN104991904A - Page data acquisition method of dynamic webpage - Google Patents

Page data acquisition method of dynamic webpage Download PDF

Info

Publication number
CN104991904A
CN104991904A CN201510332025.6A CN201510332025A CN104991904A CN 104991904 A CN104991904 A CN 104991904A CN 201510332025 A CN201510332025 A CN 201510332025A CN 104991904 A CN104991904 A CN 104991904A
Authority
CN
China
Prior art keywords
page
data
url
data acquisition
dynamic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510332025.6A
Other languages
Chinese (zh)
Inventor
焦毓葳
崔乐乐
王贵友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN201510332025.6A priority Critical patent/CN104991904A/en
Publication of CN104991904A publication Critical patent/CN104991904A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a page data acquisition method of a dynamic webpage, which comprises the following concrete implementation processes: and analyzing the environment by using the script, embedding the environment into the distributed web crawler, and realizing data acquisition of the dynamic page through the data mining, indexing and searching functions of the web crawler. Compared with the prior art, the method for acquiring the page data of the dynamic webpage has the advantages that various dynamic data are acquired in a complete form and stored in the database, people can know the dynamic state of the internet in real time conveniently, the condition that the acquired data are inaccurate and untimely is avoided, the defect that the acquired page is acquired once and is not acquired according to requirements in the traditional acquisition method is overcome, the acquisition accuracy and the acquisition efficiency are greatly improved, the practicability is high, the application range is wide, and the method is easy to popularize.

Description

A kind of page data acquisition method of dynamic web page
Technical field
The present invention relates to large data technique field, specifically a kind of page data acquisition method of practical, dynamic web page.
Background technology
Current, along with the fast development of network technology, dynamic page proportion internet being embedded with JavaScript script is increasing, brings very large difficulty to page data collecting work.In network public opinion and search engine research, although the main object that page data gathers still is static page, the demand gathered the data in dynamic page is more and more urgent.
Traditional collecting method, can only obtain the static data in webpage, the data changed dynamically, in real time for some are helpless, use traditional acquisition method, not only waste a large amount of manpowers and time, and collection effect and the quality of data also very poor.
Based on this, now provide a kind of page data acquisition method of dynamic web page, the method is by Nutch data acquisition process, and Nutch is the search engine that the Java that increases income realizes.It provide the whole instruments needed for search engine that we run oneself.Comprise full-text search and Web reptile.Nutch is the search engine that the Java that increases income realizes, and utilizes Nutch web crawlers technology, builds dynamic page and automatically resolves task, effectively can solve the shortcoming of traditional html page capture technology, improve collecting efficiency and acquisition cost.
Summary of the invention
Technical assignment of the present invention is for above weak point, provides a kind of page data acquisition method of practical, dynamic web page.
A page data acquisition method for dynamic web page, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.
Described dynamic page data acquisition mining process is:
First create original url list, inject original URL;
Generate and capture list, and by network in webpage capture data;
The web data content collected by resolver resolves, obtains relevant page info;
Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;
Index process is:
The webpage that reverse indexing gathers, deletes content and the URL of redundancy;
Little index is synthesized large index, and sets up index database;
Search procedure is:
The interactive interface that user is provided by search engine sends searching request;
After search engine completes search procedure, by result feedback to user.
Described original URL is empty URL storehouse, and the original URL of injection is initial root URL.
Described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.
Described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:
Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;
Index part is made reverse indexing to search for by gathering the image data of returning;
Related data is searched in the input data search of the user interface that search part provides according to Nutch.
When user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.
The page data acquisition method of a kind of dynamic web page of the present invention, has the following advantages:
Various dynamic data collects with complete form by the page data acquisition method of a kind of dynamic web page of this invention, in the middle of database, us are facilitated to understand internet in real time dynamic, avoid that image data is inaccurate, situation not in time, compensate in traditional acquisition method the shortcoming gathering the page and only gather a time and do not gather according to demand, greatly improve accuracy and the collecting efficiency of collection, practical, applied widely, be easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 realizes schematic diagram for of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
The problem that in dynamic data in webpage or webpage, Dynamic Data Acquiring rate is on the low side, acquisition cost is very high cannot be gathered for solving in prior art, the page data acquisition method that a kind of dynamic web page is provided of the present invention, the present invention is mainly for dynamic data increasing on internet, as news data, BBS data and network public-opinion data etc., carry out the process dynamically captured.Script is resolved environment and is embedded in distributed network reptile by the program, achieves the data acquisition of dynamic page.Utilize perfect Nutch data mining and index function, revise operation steps, reach the object that we efficiently capture dynamic data.
As shown in Figure 1, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.
Described dynamic page data acquisition mining process is:
First create original url list, inject original URL;
Generate and capture list, and by network in webpage capture data;
The web data content collected by resolver resolves, obtains relevant page info;
Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;
Repeat above-mentioned steps, till reaching designated depth always.
Index process is:
The webpage that reverse indexing gathers, deletes content and the URL of redundancy;
Little index is synthesized large index, and sets up index database;
Search procedure is:
The interactive interface that user is provided by search engine sends searching request;
After search engine completes search procedure, by result feedback to user.
Described original URL is empty URL storehouse, and the original URL of injection is initial root URL.
Described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.
Described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:
Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;
Index part is made reverse indexing to search for by gathering the image data of returning;
Related data is searched in the input data search of the user interface that search part provides according to Nutch.
When user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of the page data acquisition method of any a kind of dynamic web page according to the invention and any person of an ordinary skill in the technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims (6)

1. the page data acquisition method of a dynamic web page, it is characterized in that, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.
2. the page data acquisition method of a kind of dynamic web page according to claim 1, is characterized in that, described dynamic page data acquisition mining process is:
First create original url list, inject original URL;
Generate and capture list, and by network in webpage capture data;
The web data content collected by resolver resolves, obtains relevant page info;
Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;
Index process is:
The webpage that reverse indexing gathers, deletes content and the URL of redundancy;
Little index is synthesized large index, and sets up index database;
Search procedure is:
The interactive interface that user is provided by search engine sends searching request;
After search engine completes search procedure, by result feedback to user.
3. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described original URL is empty URL storehouse, and the original URL of injection is initial root URL.
4. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.
5. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:
Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;
Index part is made reverse indexing to search for by gathering the image data of returning;
Related data is searched in the input data search of the user interface that search part provides according to Nutch.
6. the page data acquisition method of a kind of dynamic web page according to claim 5, is characterized in that, when user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.
CN201510332025.6A 2015-06-16 2015-06-16 Page data acquisition method of dynamic webpage Pending CN104991904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510332025.6A CN104991904A (en) 2015-06-16 2015-06-16 Page data acquisition method of dynamic webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510332025.6A CN104991904A (en) 2015-06-16 2015-06-16 Page data acquisition method of dynamic webpage

Publications (1)

Publication Number Publication Date
CN104991904A true CN104991904A (en) 2015-10-21

Family

ID=54303720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510332025.6A Pending CN104991904A (en) 2015-06-16 2015-06-16 Page data acquisition method of dynamic webpage

Country Status (1)

Country Link
CN (1) CN104991904A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912584A (en) * 2016-04-01 2016-08-31 南京奥灵克物联网科技有限公司 Data index system based on webpage information data
CN106649353A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Webpage data collection method and apparatus
CN107066492A (en) * 2016-12-29 2017-08-18 百视通网络电视技术发展有限责任公司 Matchmaker provides metadata acquisition method and system
CN108153595A (en) * 2018-01-18 2018-06-12 成都无糖信息技术有限公司 A kind of big data distributed task scheduling processing unit based on python
CN108520043A (en) * 2018-03-30 2018-09-11 纳思达股份有限公司 Data object acquisition method, apparatus and system, computer readable storage medium
CN108984801A (en) * 2018-08-22 2018-12-11 百卓网络科技有限公司 A kind of search engine optimization method identifying asynchronous loading content based on html tag
CN110069684A (en) * 2017-09-30 2019-07-30 北京国双科技有限公司 A kind of data crawling method, device, storage medium and processor
CN116861058A (en) * 2023-09-04 2023-10-10 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affair field

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040044962A1 (en) * 2001-05-08 2004-03-04 Green Jacob William Relevant search rankings using high refresh-rate distributed crawling
CN104050003A (en) * 2014-06-27 2014-09-17 浪潮集团有限公司 Method for starting Nutch collecting system with shell script
CN104516982A (en) * 2015-01-06 2015-04-15 南通大学 Method and system for extracting Web information based on Nutch

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040044962A1 (en) * 2001-05-08 2004-03-04 Green Jacob William Relevant search rankings using high refresh-rate distributed crawling
CN104050003A (en) * 2014-06-27 2014-09-17 浪潮集团有限公司 Method for starting Nutch collecting system with shell script
CN104516982A (en) * 2015-01-06 2015-04-15 南通大学 Method and system for extracting Web information based on Nutch

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
常智荣: "搜索引擎Nutch在数字图书馆中集成应用的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
许武权: "基于Web文本信息的智能检索系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649353A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Webpage data collection method and apparatus
CN106649353B (en) * 2015-10-30 2020-05-22 北京国双科技有限公司 Method and device for collecting webpage data
CN105912584A (en) * 2016-04-01 2016-08-31 南京奥灵克物联网科技有限公司 Data index system based on webpage information data
CN105912584B (en) * 2016-04-01 2020-07-31 南京奥灵克物联网科技有限公司 Data indexing system based on webpage information data
CN107066492A (en) * 2016-12-29 2017-08-18 百视通网络电视技术发展有限责任公司 Matchmaker provides metadata acquisition method and system
CN110069684A (en) * 2017-09-30 2019-07-30 北京国双科技有限公司 A kind of data crawling method, device, storage medium and processor
CN108153595A (en) * 2018-01-18 2018-06-12 成都无糖信息技术有限公司 A kind of big data distributed task scheduling processing unit based on python
CN108520043A (en) * 2018-03-30 2018-09-11 纳思达股份有限公司 Data object acquisition method, apparatus and system, computer readable storage medium
CN108984801A (en) * 2018-08-22 2018-12-11 百卓网络科技有限公司 A kind of search engine optimization method identifying asynchronous loading content based on html tag
CN116861058A (en) * 2023-09-04 2023-10-10 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affair field
CN116861058B (en) * 2023-09-04 2024-04-12 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affair field

Similar Documents

Publication Publication Date Title
CN104991904A (en) Page data acquisition method of dynamic webpage
CN104915398B (en) A kind of webpage buries method and device a little
CN102710795B (en) Hotspot collecting method and device
CN102521232B (en) Distributed acquisition and processing system and method of internet metadata
CN104182506A (en) Log management method
CN102254027A (en) Method for obtaining webpage contents in batch
CN103942335A (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN103309884A (en) User behavior data collecting method and system
CN103699389A (en) Linux kernel module relation extracting method based on compiling options
CN105069087A (en) Web log data mining based website optimization method
JP2009048380A5 (en)
CN105320754A (en) Data searching system and method
CN104615627A (en) Event public sentiment information extracting method and system based on micro-blog platform
CN103927400A (en) Web site product detailed information classification crawling and product information base establishing method
CN104391706A (en) Reverse engineering based model base structuring method
CN104102658A (en) Method and device for mining text contents
CN105162822A (en) Website log data processing method and device
CN101354706A (en) Method and apparatus for collecting web page information
CN102508884A (en) Method and device for acquiring hotpot events and real-time comments
CN104239472A (en) Method and device for providing object information
CN104317857A (en) House information acquisition service system
CN105426407A (en) Web data acquisition method based on content analysis
CN103744944A (en) Method for re-filtering in webpage or data crawling by web crawler
CN103838797A (en) Method for optimizing mobile search engine
CN104317880A (en) Method special for microblog data acquisition mode

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151021

WD01 Invention patent application deemed withdrawn after publication