CN106484828B

CN106484828B - Distributed internet data rapid acquisition system and acquisition method

Info

Publication number: CN106484828B
Application number: CN201610864062.6A
Authority: CN
Inventors: 张晖; 杨春明; 李晓伟; 李波; 赵旭剑
Original assignee: Southwest University of Science and Technology
Current assignee: Sichuan Zhongke Rongchuang Technology Co ltd
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2020-01-21
Anticipated expiration: 2036-09-29
Also published as: CN106484828A

Abstract

The invention discloses a distributed internet data rapid acquisition system which comprises five layers, namely a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer; the seed website setting node is used for setting various parameters and extraction rules of the storage data source; the hyperlink acquisition layer is used for requesting a hyperlink list webpage of the data source and extracting a hyperlink of a target webpage; the real-time queue is used for accessing the URL hyperlinks extracted by the hyperlink acquisition layer and corresponding extraction rules and the accessed URL hyperlinks; the webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data; the webpage data storage layer is used for storing the target data which are extracted by the webpage downloading and analyzing layer in a formatting mode. The invention adopts a distributed hierarchical cooperation mode to acquire data and can meet the application requirements of a system with large data acquisition amount, multiple data sources and high real-time requirement.

Description

Distributed internet data rapid acquisition system and acquisition method

Technical Field

The invention belongs to the technical field of internet big data acquisition, and particularly relates to a distributed internet data rapid acquisition system and an acquisition method.

Background

The rapid development of the internet brings society into an information age with highly developed and open data, and a big data age has come. The data has an extremely important role in enterprise operation, government decision, social dynamic analysis and the like, and how to acquire data rapidly in a large scale becomes a technical focus, but from the prior technical scheme, a data acquisition method needs to be improved. The traditional internet data acquisition mainly uses a web crawler as a main tool and uses structured or semi-structured text data as an object to collect data. The web crawler is a program or script which automatically walks to crawl internet text web pages according to certain rules. Text data is mostly nested in the web page program code. The real-time property of data acquisition directly determines the effectiveness and timeliness of data, and the rapid acquisition of the data becomes the most important factor.

Disclosure of Invention

In view of this, the invention provides a distributed internet data rapid acquisition system and an acquisition method, aiming at the problems of large data acquisition amount, multiple data sources and low real-time performance.

In order to solve the technical problem, the invention discloses a distributed internet data rapid acquisition system which comprises five layers, namely a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer;

the seed website setting node is used for setting various parameters, extraction rules and the like of a storage data source and is a single node;

the hyperlink acquisition layer is used for requesting a hyperlink list webpage of a data source and extracting a hyperlink of a target webpage;

the real-time queue is used for accessing the URL hyperlink extracted by the hyperlink acquisition layer, the extraction rule corresponding to the hyperlink and the accessed URL hyperlink;

the webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data;

the webpage data storage layer is used for storing the target data extracted by the webpage downloading and analyzing layer in a formatting mode.

Compared with the prior art, the invention can obtain the following technical effects:

1) the acquisition system of the invention runs in a waterfall mode, has high real-time performance and strong expandability, and has stronger response capability to the system requirements of more data sources and large data acquisition amount.

2) The invention adopts a distributed layered cooperation acquisition mode to acquire data, can meet the system requirements of more data sources, large data acquisition quantity and high real-time performance, and has the characteristics of higher expandability and customizability. The data extraction comprises two extraction schemes of a structured accurate extraction scheme and a general text extraction scheme (only aiming at text parts), and the extracted data has higher integrity.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a block diagram of a distributed Internet data rapid acquisition system of the present invention;

fig. 2 is a flow chart of the distributed internet data rapid acquisition method of the present invention.

Detailed Description

The following embodiments are described in detail with reference to the accompanying drawings, so that how to implement the technical features of the present invention to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The invention discloses a distributed internet data rapid acquisition system, which comprises a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer as shown in figure 1,

the seed website setting node is used for setting various parameters, extraction rules and the like of a storage data source and is a single node; the seed website setting node uses a relational database.

The hyperlink acquisition layer is used for requesting a hyperlink list webpage of a data source and extracting a hyperlink of a target webpage; the hyperlink acquisition layer is composed of a single or a plurality of web crawler nodes, the web crawlers are physically isolated from each other and logically cooperate to complete the hyperlink extraction work of the target webpage, and the hyperlink acquisition layer can be transversely expanded according to the scale of a data source. The hyperlink acquisition layer is distributed and multi-node and can be transversely expanded, the operation of a single node is timed, and the node integrates the acquired hyperlinks into a corresponding extraction rule and stores the extracted hyperlink into a real-time queue.

The real-time queue is used for accessing the URL hyperlink extracted by the hyperlink acquisition layer, the extraction rule corresponding to the hyperlink and the accessed URL hyperlink; the real-time queue is a core component of cooperative crawling operation nodes. The layer has high real-time performance, can realize real-time storage and extraction of data, has the capacity of persistent storage, and simultaneously plays a role in filtering the collected URL hyperlink. The real-time queue is deployed independently, the unaccessed hyperlinks are accessed in real-time, and the accessed hyperlinks are stored persistently.

The webpage downloading and analyzing layer is used for requesting and analyzing the URL hyperlink which is not accessed in the real-time queue and formatting and extracting specific data. The webpage downloading and analyzing layer similar to the hyperlink acquisition layer is composed of a plurality of web crawler nodes, the nodes work independently, the nodes are mainly responsible for the structured information extraction work of target data, and the hyperlink scale extracted by the hyperlink acquisition layer can be transversely expanded. Each node comprises a set of information structuring extraction method based on HTML document tree and a set of general text information extraction method, and the extraction of the text part of the webpage can be switched and used. The webpage downloading and analyzing layer is distributed and multi-node and can be transversely expanded, the operation of a single node is real-time, and the nodes read hyperlinks in the real-time queue, then filter, analyze and format target data and store the target data.

The webpage data storage layer is used for storing the target data extracted by the webpage downloading and analyzing layer in a formatting mode. The webpage data storage layer is realized by adopting an open-source big data storage database, is multi-node storage, has strong storage capacity for storing webpage document data, has large storage capacity and excellent reading performance, and can dynamically expand the number of nodes.

A method for rapidly acquiring data based on a distributed network is based on the system for rapidly acquiring data based on the distributed network, as shown in FIG. 2, and comprises the following specific steps:

step 1, setting all seed URL, extraction rules, website codes and other information by a seed website setting node;

the user adds all target websites needing to be collected through a web system, the target websites comprise a plurality of target blocks which are interesting to the user, such as education, livelihood and the like, and then the following settings are carried out:

firstly, setting website information including a domain name, a type, a page code, a universal target URL filtering regular expression and a universal information extraction rule (including an author, release time, a text and the like) of a current website, wherein the information currently set by the website is suitable for all collected sections under the website; as shown in table 1:

table 1 website setting table

And secondly, setting the layout information of the website, including the name and the seed URL of the layout, and setting the information at a corresponding position in a personalized manner if the domain name, the target URL filtering regular expression and the information extraction rule of the layout are different from the general setting of the website. If the website sets the domain name to http:// news sc. org/, and the version block has the domain name to http:// edu. news sc. org/. If some information of the current layout block is the same as the general settings of the website, the layout block does not need to be set repeatedly. Its subordinate section is set for "website 1" in table 1 as table 2:

table 2 table 1 'website 1' section setting table

Step 2, the nodes in the hyperlink acquisition layer read the data source information at regular time and acquire the URL of the specific list page of the data source, and the data source information and the URL are formatted and stored in a real-time queue together with the corresponding structured extraction rule;

the nodes in the hyperlink acquisition layer read the website and the layout information set in the step 1 at regular time, format processing is carried out by taking the layouts as basic units, and the process is described in detail as follows for each acquisition layout:

step one, inheriting all information of a website to which a current edition block belongs;

secondly, if the domain name, URL filtering regular expression, and page information extraction rule (including author, release time, text, etc.) are set, the block will use its personalized setting to replace the corresponding setting information inherited from the website setting, as shown in table 3:

table 3 block formatting

Thirdly, requesting a URL page of the section seed, and extracting a URL set of a target webpage by using a URL filtering regular expression of a current section (each section is collected to obtain a URL set);

and fourthly, with a single element in each current URL set as a target URL, splicing the complete target URL by using the domain name set by the current block (because the target URL may exist in a mode of '/abc.html' or '/abc.html' in a page, the target URL needs to be spliced into a complete 'http:// cu.china.com.cn/abc.html') form a new multi-element group by combining with the corresponding block information set in the second step, such as:

the URL set is obtained by collecting the seed URL of the section 1 (i.e., S1):

SET_1:(link_1,link_2,link_3………..link_m)

the tuples stored in the queue are:

< link _1, section 1, Website 1, GBK,1, rule 2, rule 3, (general) rule 4>

< link _2, section 1, Website 1, GBK,1, rule 2, rule 3, (general) rule 4>

……………………………

< link _ m, section 1, website 1, GBK,1, rule 2, rule 3, (general) rule 4 >;

set of URLs collected by seed URL of section 2 (S2):

SET2：(link_1,link_2,link_3………..link_n)

then the tuple stored in the queue:

< link _1, section 1, website 1, GBK,1, (general) rule 2, (general) rule 3, rule 4>

……………………………

< link _ n, section 1, website 1, GBK,1, (general) rule 2, (general) rule 3, rule 4 >.

The information in the tuple includes the page code, website name, layout name, page code, title extraction rule, author extraction rule, text extraction rule, release time extraction rule, website type, etc. of the target URL. And finally, pressing the set multi-element group into a real-time queue, and binding the target URL with the extraction rule in the process to realize one-to-one correspondence.

And 3, reading the webpage hyperlink request download in the real-time queue in real time by the nodes in the webpage downloading and analyzing layer, and extracting data in a structured manner by combining with the extraction rule accompanying the transmission, if the text data is successfully extracted, directly storing the data into a webpage data storage layer, otherwise, extracting the text by adopting a general text extraction method, and then storing the text into the webpage data storage layer.

A node in the webpage downloading and analyzing layer reads the multi-element group set in the step 2 in real time, requests a target page, analyzes the page by adopting a corresponding extraction rule in the multi-element group for the target page which is successfully requested, extracts non-text part information, and reports out that the rule is empty abnormal or extraction failure abnormal and records a log if the rule is empty or the extraction is invalid by adopting the current rule; for text extraction, if the set text extraction rule is empty or the extraction fails, a general text extraction method (serving as a standby extraction means and not setting a rule) is adopted, and if the extraction fails, a general extraction failure exception is reported and a log is recorded; and if the extraction is successful, storing the corresponding data into the data storage layer in a single record mode.

Note that: general text extraction methods (density-based text extraction methods) used in the present invention have been published in papers, see journal: [1] zhu Jed, Li \28156, Zhang Jian, Chen Lei, Zengxinhua, extracting Web text based on a text density model [ J ] pattern recognition and artificial intelligence, 2013,07:667 plus 672.

While the foregoing description shows and describes several preferred embodiments of the invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A distributed network data-based rapid acquisition method is based on a distributed network data rapid acquisition system and is characterized in that the distributed network data rapid acquisition system comprises five layers, namely a seed website setting node, a hyperlink acquisition layer, a real-time queue, a webpage downloading and analyzing layer and a webpage data storage layer;

the seed website setting node is used for setting various parameters and extraction rules of a storage data source and is a single node;

the webpage data storage layer is used for storing the target data which are formatted and extracted by the webpage downloading and analyzing layer;

the method comprises the following specific steps:

step 1, setting all seed URLs, extraction rules and website coding information by a seed website setting node;

step 3, reading a webpage hyperlink request in a real-time queue in real time by a node in the webpage downloading and analyzing layer, downloading the webpage hyperlink request and extracting data in a structured manner by combining with an extraction rule accompanying the transmission, if the text data is successfully extracted, directly storing the webpage hyperlink request into a webpage data storage layer, otherwise, extracting the text by adopting a general text extraction method, and storing the webpage hyperlink request into the webpage data storage layer;

the step 1 specifically comprises the following steps: the method comprises the following steps that a user adds all target websites needing to be collected through a web system, wherein the target websites comprise a plurality of target sections which are interesting to the user, and then the following settings are carried out:

the method comprises the steps that firstly, set website information comprises a domain name, a type, a page code, a universal target URL filtering regular expression and a universal information extraction rule of a current website, and the information currently set by the website is suitable for all collection sections of the website;

secondly, setting the layout information of the website, including the name and seed URL of the layout, and setting the information at a corresponding position in a personalized manner if the domain name, the target URL filtering regular expression and the information extraction rule of the layout are different from the general setting of the website;

the step 2 specifically comprises the following steps: the nodes in the hyperlink acquisition layer read the website and the layout information set in the step 1 at regular time, format processing is carried out by taking the layouts as basic units, and the process is described in detail as follows for each acquisition layout:

secondly, if a domain name, URL filtering regular expression and page information extraction rule block are set, the block uses the personalized setting of the block to replace corresponding setting information inherited from website setting;

thirdly, requesting a URL page of the section seed, extracting a URL set of a target webpage by using a URL filtering regular expression of a current section, and acquiring a URL set by each section;

fourthly, taking a single element in each current URL set as a target URL, combining the target URL which is completely spliced by using the domain name set by the current plate with the plate information set in the corresponding second step to form a new multi-element group, wherein the information in the multi-element group comprises the page code, the website name, the plate name, the page code, the title extraction rule, the author extraction rule, the text extraction rule, the release time extraction rule and the website type of the target URL; and finally, pressing the set multi-element group into a real-time queue, and binding the target URL with the extraction rule in the process to realize one-to-one correspondence.

2. The distributed network data-based rapid acquisition method according to claim 1, wherein step 3 specifically comprises: a node in the webpage downloading and analyzing layer reads the multi-tuple set in the real-time queue in real time through the step 2, requests a target page, analyzes the page by adopting a corresponding extraction rule in the multi-tuple for the target page which is successfully requested, extracts information of a non-text part, and reports out that the rule is empty abnormal or extraction failure abnormal and records a log if the rule is empty or the extraction is invalid by adopting the current rule; for text extraction, if the set text extraction rule is empty or the extraction fails, a general text extraction method is adopted, and if the extraction fails, a general extraction failure exception is reported and a log is recorded; and if the extraction is successful, storing the corresponding data into the data storage layer in a single record mode.

3. The distributed network data-based rapid acquisition method according to claim 1, wherein the seed website setting node uses a relational database.

4. The distributed network-based data rapid acquisition method as claimed in claim 1, wherein the hyperlink acquisition layer is composed of a single or a plurality of web crawler nodes, the web crawlers are physically isolated from each other, logically cooperate to complete the hyperlink extraction work of the target web page, and can be transversely expanded according to the scale of the data source; the hyperlink acquisition layer is distributed and multi-node, the operation of a single node is timed, and the node integrates the acquired hyperlinks into the corresponding extraction rule and stores the extracted hyperlinks into a real-time queue.

5. The distributed network-based data rapid acquisition method of claim 1 wherein the real-time queue is independently deployed, unaccessed hyperlinks are accessed in real-time, and accessed hyperlinks are persistently stored.

6. The distributed network-based data rapid acquisition method as claimed in claim 1, wherein the web page downloading and parsing layer is composed of a plurality of web crawler nodes, the nodes work independently of each other and are responsible for the structured information extraction of the target data, and the hyperlink scale extracted by the hyperlink acquisition layer can be expanded horizontally; each node comprises a set of information structured extraction method based on HTML document tree and a set of general text information extraction method, and the extraction of the text part of the webpage can be switched for use;

the webpage downloading and analyzing layer is distributed and multi-node and can be transversely expanded, the operation of a single node is real-time, and the nodes read hyperlinks in the real-time queue, then filter, analyze and format target data and store the target data.

7. The distributed network-based data rapid acquisition method according to claim 1, wherein the webpage data storage layer adopts an open-source large data storage database, and the multiple nodes are expandable.