CN104991904A

CN104991904A - Page data acquisition method of dynamic webpage

Info

Publication number: CN104991904A
Application number: CN201510332025.6A
Authority: CN
Inventors: 焦毓葳; 崔乐乐; 王贵友
Original assignee: Inspur Software Group Co Ltd
Current assignee: Inspur Software Group Co Ltd
Priority date: 2015-06-16
Filing date: 2015-06-16
Publication date: 2015-10-21

Abstract

The invention discloses a page data acquisition method of a dynamic webpage, which comprises the following concrete implementation processes: and analyzing the environment by using the script, embedding the environment into the distributed web crawler, and realizing data acquisition of the dynamic page through the data mining, indexing and searching functions of the web crawler. Compared with the prior art, the method for acquiring the page data of the dynamic webpage has the advantages that various dynamic data are acquired in a complete form and stored in the database, people can know the dynamic state of the internet in real time conveniently, the condition that the acquired data are inaccurate and untimely is avoided, the defect that the acquired page is acquired once and is not acquired according to requirements in the traditional acquisition method is overcome, the acquisition accuracy and the acquisition efficiency are greatly improved, the practicability is high, the application range is wide, and the method is easy to popularize.

Description

A kind of page data acquisition method of dynamic web page

Technical field

The present invention relates to large data technique field, specifically a kind of page data acquisition method of practical, dynamic web page.

Background technology

Current, along with the fast development of network technology, dynamic page proportion internet being embedded with JavaScript script is increasing, brings very large difficulty to page data collecting work.In network public opinion and search engine research, although the main object that page data gathers still is static page, the demand gathered the data in dynamic page is more and more urgent.

Traditional collecting method, can only obtain the static data in webpage, the data changed dynamically, in real time for some are helpless, use traditional acquisition method, not only waste a large amount of manpowers and time, and collection effect and the quality of data also very poor.

Based on this, now provide a kind of page data acquisition method of dynamic web page, the method is by Nutch data acquisition process, and Nutch is the search engine that the Java that increases income realizes.It provide the whole instruments needed for search engine that we run oneself.Comprise full-text search and Web reptile.Nutch is the search engine that the Java that increases income realizes, and utilizes Nutch web crawlers technology, builds dynamic page and automatically resolves task, effectively can solve the shortcoming of traditional html page capture technology, improve collecting efficiency and acquisition cost.

Summary of the invention

Technical assignment of the present invention is for above weak point, provides a kind of page data acquisition method of practical, dynamic web page.

A page data acquisition method for dynamic web page, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.

Described dynamic page data acquisition mining process is:

First create original url list, inject original URL;

Generate and capture list, and by network in webpage capture data;

The web data content collected by resolver resolves, obtains relevant page info;

Extract the URL parsed to connect, and URL storehouse is upgraded, data acquisition mining process;

Index process is:

The webpage that reverse indexing gathers, deletes content and the URL of redundancy;

Little index is synthesized large index, and sets up index database;

Search procedure is:

The interactive interface that user is provided by search engine sends searching request;

After search engine completes search procedure, by result feedback to user.

Described original URL is empty URL storehouse, and the original URL of injection is initial root URL.

Described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.

Described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:

Part of data acquisition is responsible for the crawl of web data, and analyzing web page also enters the crawl work of next round page data according to the URL link information obtained;

Index part is made reverse indexing to search for by gathering the image data of returning;

Related data is searched in the input data search of the user interface that search part provides according to Nutch.

When user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.

The page data acquisition method of a kind of dynamic web page of the present invention, has the following advantages:

Various dynamic data collects with complete form by the page data acquisition method of a kind of dynamic web page of this invention, in the middle of database, us are facilitated to understand internet in real time dynamic, avoid that image data is inaccurate, situation not in time, compensate in traditional acquisition method the shortcoming gathering the page and only gather a time and do not gather according to demand, greatly improve accuracy and the collecting efficiency of collection, practical, applied widely, be easy to promote.

Accompanying drawing explanation

Accompanying drawing 1 realizes schematic diagram for of the present invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the invention will be further described.

The problem that in dynamic data in webpage or webpage, Dynamic Data Acquiring rate is on the low side, acquisition cost is very high cannot be gathered for solving in prior art, the page data acquisition method that a kind of dynamic web page is provided of the present invention, the present invention is mainly for dynamic data increasing on internet, as news data, BBS data and network public-opinion data etc., carry out the process dynamically captured.Script is resolved environment and is embedded in distributed network reptile by the program, achieves the data acquisition of dynamic page.Utilize perfect Nutch data mining and index function, revise operation steps, reach the object that we efficiently capture dynamic data.

As shown in Figure 1, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.

Described dynamic page data acquisition mining process is:

First create original url list, inject original URL;

Generate and capture list, and by network in webpage capture data;

Repeat above-mentioned steps, till reaching designated depth always.

Index process is:

Little index is synthesized large index, and sets up index database;

Search procedure is:

After search engine completes search procedure, by result feedback to user.

Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of the page data acquisition method of any a kind of dynamic web page according to the invention and any person of an ordinary skill in the technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims

1. the page data acquisition method of a dynamic web page, it is characterized in that, its specific implementation process is: use script resolve environment and be embedded in distributed network reptile, by data mining, the index and search function of this web crawlers, realize the data acquisition of dynamic page.

2. the page data acquisition method of a kind of dynamic web page according to claim 1, is characterized in that, described dynamic page data acquisition mining process is:

First create original url list, inject original URL;

Generate and capture list, and by network in webpage capture data;

Index process is:

Little index is synthesized large index, and sets up index database;

Search procedure is:

After search engine completes search procedure, by result feedback to user.

3. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described original URL is empty URL storehouse, and the original URL of injection is initial root URL.

4. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described crawl list generation and crawl process are: in the new segment catalogue created, generate fetchlist according to URL storehouse, deposit URL to be collected; According to the collecting work of the URL information in fetchlist from the enterprising line correlation web data of network.

5. the page data acquisition method of a kind of dynamic web page according to claim 2, is characterized in that, described search engine is Nutch structure, and this Nutch structure comprises part of data acquisition, index part and search part, wherein:

6. the page data acquisition method of a kind of dynamic web page according to claim 5, is characterized in that, when user sends searching request, this searching request is converted to Lucence inquiry request by Nutch, and by result feedback to user.