CN106484828A

CN106484828A - A kind of distributed interconnection data Fast Acquisition System and acquisition method

Info

Publication number: CN106484828A
Application number: CN201610864062.6A
Authority: CN
Inventors: 张晖; 杨春明; 李晓伟; 李波; 赵旭剑
Original assignee: Southwest University of Science and Technology
Current assignee: Sichuan Zhongke Rongchuang Technology Co ltd
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2017-03-08
Anticipated expiration: 2036-09-29
Also published as: CN106484828B

Abstract

The invention discloses a kind of distributed interconnection data Fast Acquisition System, node, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer are set including seed website；Seed website arranges node and is used for arranging parameters and the decimation rule in data storage source；Hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts the hyperlink of target web；Real-time queue is used for accessing URL hyperlink and its corresponding decimation rule and the URL hyperlink having accessed of the extraction of hyperlink acquisition layer；Page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue and format extraction particular data；Web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.The present invention carries out data acquisition using distributed layer cooperation mode, copes with the system application demand that data acquisition amount is big, Data Source is many, requirement of real-time is high.

Description

A kind of distributed interconnection data Fast Acquisition System and acquisition method

Technical field

The invention belongs to the acquisition technique field of the Internet big data is and in particular to a kind of distributed interconnection data is quick Acquisition system and acquisition method.

Background technology

Society is brought into data is highly developed and disclosed information age by the developing rapidly of the Internet, and the big data epoch are already Arrive.Data has extremely important effect for enterprise operation, the dynamic analysis of government decision and society etc., and how big rule Mould, rapid data collection become technology focus, but from the point of view of prior art, collecting method has much room for improvement.Traditional Internet data gathers mainly with web crawlers as main tool, enters line number with structuring or semi-structured text data for object According to collection.Web crawlers is program or the script crawling the Internet text webpage according to the automatic migration of certain rule.Textual data According to being mostly nested in web page program code.Effective promptness of the direct determination data of real-time of data acquisition, data fast Speed collection becomes the most important thing.

Content of the invention

In view of this, the present invention is directed to the problem that data acquisition amount is big, Data Source is many and real-time is low, there is provided a kind of Distributed interconnection data Fast Acquisition System and acquisition method.

In order to solve above-mentioned technical problem, the invention discloses a kind of distributed interconnection data Fast Acquisition System, bag Include seed website setting node, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, web data accumulation layer five Layer；

Wherein seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single section Point；

Wherein hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts target web Hyperlink；

Wherein real-time queue is used for accessing URL hyperlink, this hyperlink corresponding extraction rule of the extraction of hyperlink acquisition layer The URL hyperlink then and accessing；

Wherein page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue form Change and extract particular data；

Wherein web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.

Compared with prior art, the present invention can obtain including following technique effect：

1) acquisition system of the present invention adopts waterfall mode operation, and real-time is high, and extensibility is strong, in the face of Data Source is many, The big system requirements of data acquisition amount has stronger adaptibility to response.

2) present invention carries out data acquisition using the acquisition mode that distributed layer cooperates, cope with Data Source many, The system requirements that data acquisition amount is big, real-time is high, has the characteristics that higher extensibility, customizability simultaneously.Data is taken out Take and accurately extract scheme and general two sets of extraction schemes of text extracting (just for body part) including structuring, extracting data has Higher integrity.

Certainly, the arbitrary product implementing the present invention it is not absolutely required to reach all the above technique effect simultaneously.

Brief description

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the present invention, this Bright schematic description and description is used for explaining the present invention, does not constitute inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is the structure chart of distributed interconnection data Fast Acquisition System of the present invention；

Fig. 2 is the flow chart of distributed interconnection data Quick Acquisition method of the present invention.

Specific embodiment

To describe embodiments of the present invention below in conjunction with embodiment in detail, thereby to the present invention how application technology handss Section is solving technical problem and to reach realizing process and fully understanding and implement according to this of technology effect.

A kind of present invention distributed interconnection data Fast Acquisition System, as shown in figure 1, include seed website setting section Point, hyperlink acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer,

Wherein seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single section Point；Seed website setting node uses relevant database.

Wherein hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts target web Hyperlink；Hyperlink acquisition layer is made up of single or some web crawlers nodes, and these web crawlers are physically isolated from one another, patrol Collect the hyperlink extraction work that cooperation completes target web, the scale according to data source can be extending transversely.Hyperlink acquisition layer is Distributed multinode, can be extending transversely, the operation of individual node is timing, and the hyperlink gathering is integrated by this node layer Corresponding decimation rule is stored in real-time queue.

Wherein real-time queue is used for accessing URL hyperlink, this hyperlink corresponding extraction rule of the extraction of hyperlink acquisition layer The URL hyperlink then and accessing；Real-time queue is the core component crawling running node cooperative cooperating.This layer has very high Real-time, be capable of being stored in real time, taking out of data, there is the ability of persistent storage, simultaneously the also URL to collection Hyperlink plays a role in filtering.Real-time queue is independent deployment, the hyperlink having not visited is real time access, access Hyperlink is lasting storage.

Wherein page download and analytic sheaf are used for asking and parsing the URL hyperlink having not visited in real-time queue form Change and extract particular data.Page download hyperlink acquisition layer similar with analytic sheaf is made up of multiple web crawlers nodes, between node Work independently of one another, be mainly responsible for the structured message extraction work of target data, the hyperlink extracted according to hyperlink acquisition layer The scale of connecing can be extending transversely.It is general with a set of that each node includes a set of message structure abstracting method based on html document tree Text message abstracting method, for the changeable use of extraction of Web page text part.Page download and analytic sheaf are distributed many Node, can be extending transversely, the operation of individual node is real-time, carries out after the hyperlink in node reading real-time queue Filter, parsing, formatting are extracted target data and are stored operation.

Wherein web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.Webpage number Realized using big data data storage storehouse of increasing income according to accumulation layer, be multinode storage, storage web document data is had very Strong storage capacity, amount of storage is big, and reading performance is excellent, dynamic extending number of nodes.

One kind is based on distributed network data Quick Acquisition method, based on above-mentioned distributed network data Quick Acquisition system System, as shown in Fig. 2 comprise the following steps that：

Step 1, seed website setting node arranges the information such as all seed URL, decimation rule, website coding；

User by web system add institute collection in need targeted website, comprise in these targeted websites user feel emerging Multiple target columns of interest such as education, people's livelihood etc., are then arranged as follows：

The first step, it is general that the site information of setting includes the domain name of current site, title, type, page coding, this website Target url filtering regular expression and general information decimation rule (including author, issuing time, text etc.), this website is worked as The information of front setting is applied to all collection columns under this website；As shown in table 1：

Table 1 website arranges table

Second step, the column information of setting website, including the title of column, seed URL, if the domain name of column, target URL Filter regular expression and information extraction rules are different from the general setting of this website, then should in corresponding position personal settings Place's information.Domain name as website setting is http://newssc.org/, and the domain name of column is http:// edu.newssc.org/.If some information of current column are identical with the general setting in website, this column need not be repeated Setting." website 1 " as table 2 is directed in table 1 arranges its subordinate's column：

" website 1 " column setting table in table 2 table 1

Step 2, the node timing reading data source information in hyperlink acquisition layer gathered data source particular list page URL, is stored in real-time queue after formatting and with corresponding structuring decimation rule；

The website of node timing read step 1 setting in hyperlink acquisition layer and column information, are substantially single with column Position formats process, is specifically described as each collection this process of column：

The first step, inherits all information of current column affiliated web site；

Second step, if domain name, url filtering regular expressions, page info decimation rule (include author, issuing time, text Deng) column arranged, this column will be substituted using itself personal settings and inherit, from the setting of website, the relative set come Information, as shown in table 3：

Table 3 column formats

3rd step, asks the column seed URL page, and the url filtering regular expression using current column extracts target network The set of URL of page closes (each column collects a set of URL and closes)；

4th step, with the individual element in current each set of URL conjunction for target URL, using the domain of current column setting Name complete target URL of splicing (because target URL may be with the page '/abc.html ' or ' ./abc.html ' mode Exist, complete ' http need to be spliced into://cul.china.com.cn/abc.html ' form) set with reference in corresponding second step The column information put forms new many tuples, such as：

Collect set of URL by the seed URL (i.e. S1) of column 1 to close：

SET_1:(link_1,link_2,link_3………..link_m)

The many tuples being then stored in queue are：

<Link_1, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>

<Link_2, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>

……………………………

<Link_m, column 1, website 1, GBK, 1, rule 1, rule 2, rule 3, (general) rule 4>；

Closed by the set of URL that the seed URL (S2) of column 2 gathers：

SET2：(link_1,link_2,link_3………..link_n)

Then it is stored in many tuples of queue：

<Link_1, column 1, website 1, GBK, 1, (general) rule 1, (general) rule 2, (general) rule 3, rule 4>

……………………………

<Link_n, column 1, website 1, GBK, 1, (general) rule 1, (general) rule 2, (general) rule 3, rule 4>.

Information in many tuples includes the page coding of this target URL, web site name, column title, page coding, mark Topic decimation rule, author's decimation rule, text extracting rule, issuing time decimation rule, Type of website etc..Finally will set Many tuples press-in real-time queues in, target URL is bound by this process with decimation rule, accomplishes to correspond.

Step 3, page download reads the hyperlinks between Web pages request download in real-time queue in real time with the node in analytic sheaf And combine with the decimation rule structuring extraction data transmitted, if textual data extracts successfully, it is directly stored in web data and deposits In reservoir, otherwise text extraction is carried out using the method for general text extracting, be then stored in web data accumulation layer.

Page download reads in the many tuples being set by step 2 in real-time queue with the node in analytic sheaf in real time, The request target page, for asking successful target pages, is parsed to the page using the corresponding decimation rule in many tuples, Non- text partial information is extracted, if rule was lost efficacy for empty or using current rule extraction, it is empty abnormal for quoting rule Or extract unsuccessfully abnormal and log；For text extracting, if the text extracting rule of setting for sky or extracts unsuccessfully, Then adopting general text extracting method (as standby extraction means, non-setting rule), if extracting unsuccessfully, quoting general extraction The abnormal simultaneously log of failure；If extracting successfully, corresponding data is stored in data storage layer in the way of wall scroll record.

Note：Used in the present invention, general text extracting method (text extracting method based on density) is disclosed opinion Literary composition, referring to periodical：[1] Zhu Zede, Li Miao, Zhang Jian, Chen Lei, Zeng Xinhua. the Web text extracting [J] based on text density model. Pattern recognition and artificial intelligence, 2013,07:667-672.

Described above illustrate and describes some preferred embodiments of invention, but as previously mentioned it should be understood that inventing not It is confined to form disclosed herein, be not to be taken as the exclusion to other embodiment, and can be used for various other combinations, modification And environment, and can be carried out by the technology or knowledge of above-mentioned teaching or association area in invention contemplated scope described herein Change.And the change that those skilled in the art are carried out and change without departing from the spirit and scope of invention, then all should weighed appended by invention In the protection domain that profit requires.

Claims

1. a kind of distributed interconnection data Fast Acquisition System is it is characterised in that include seed website setting node, hyperlink Acquisition layer, real-time queue, page download and analytic sheaf, five layers of web data accumulation layer；

Described seed website setting node is used for arranging parameters and the decimation rule in data storage source etc., is single node；

Described hyperlink acquisition layer is used for the list of hyperlinks webpage of data source is made requests on and extracts the hyperlink of target web Connect；

Described real-time queue be used for access hyperlink acquisition layer extraction URL hyperlink, the corresponding decimation rule of this hyperlink and The URL hyperlink having accessed；

Described page download and analytic sheaf are used for asking and parse the URL hyperlink having not visited in real-time queue formatting and carry Take particular data；

Described web data accumulation layer is used for storing page download and analytic sheaf formats the target data of extraction.

2. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described seed website sets Put node and use relevant database.

3. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described hyperlink gathers Layer is made up of single or some web crawlers nodes, and these web crawlers are physically isolated from one another, logic cooperation completes target network The hyperlink extraction work of page, the scale according to data source can be extending transversely；Hyperlink acquisition layer is distributed multinode, single The operation of individual node is timing, and the hyperlink of collection is integrated corresponding decimation rule and is stored in real-time queue by this node layer.

4. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described real-time queue is Independent deployment, the hyperlink having not visited is real time access, and the hyperlink of access is lasting storage.

5. distributed interconnection data Fast Acquisition System as claimed in claim 1 it is characterised in that described page download with Analytic sheaf is made up of multiple web crawlers nodes, works independently of one another between node, is mainly responsible for the structured message of target data Extract work, can be extending transversely according to the hyperlink scale that hyperlink acquisition layer extracts；Each node include a set of based on HTML The message structure abstracting method of document tree and a set of general text message abstracting method, the extraction for Web page text part can Switching uses；

Page download and analytic sheaf are distributed multinodes, can be extending transversely, and the operation of individual node is real-time, node Carry out after reading the hyperlink in real-time queue filtering, parse, format extraction target data and store operation.

6. distributed interconnection data Fast Acquisition System as claimed in claim 1 is it is characterised in that described web data is deposited Reservoir uses big data data storage storehouse of increasing income, and multinode is extendible.

7. one kind is based on distributed network data Quick Acquisition method, based on the arbitrary described distributed network of claim 1～6 Data Fast Acquisition System is it is characterised in that comprise the following steps that：

Step 2, the URL of data source information gathered data source particular list page is read in the node timing in hyperlink acquisition layer, It is stored in real-time queue after formatting and with corresponding structuring decimation rule；

Step 3, the hyperlinks between Web pages request that the node in page download and analytic sheaf reads in real-time queue in real time is downloaded and is tied Close and extract data with the decimation rule structuring of transmission, if textual data extracts successfully, be directly stored in web data accumulation layer In, otherwise text extraction is carried out using the method for general text extracting, be then stored in web data accumulation layer.

8. it is based on distributed network data Quick Acquisition method as claimed in claim 7 it is characterised in that step 1 is specially： User by web system add collection in need targeted website, comprise interested multiple of user in these targeted websites Target column, is then arranged as follows：

The first step, the site information of setting includes the domain name of current site, title, type, page coding, this website general target Url filtering regular expression and general information decimation rule, the information of this website current setting is applied under this website all Collection column；

Second step, the column information of setting website, including the title of column, seed URL, if the domain name of column, target url filtering Regular expression and information extraction rules are different from the general setting of this website, then believe at this in corresponding position personal settings Breath.

9. it is based on distributed network data Quick Acquisition method as claimed in claim 8 it is characterised in that step 2 is specially： The website of node timing read step 1 setting in hyperlink acquisition layer and column information, carry out lattice with column for ultimate unit Formulaization is processed, and is specifically described as each collection this process of column：

Second step, if domain name, url filtering regular expressions, page info decimation rule column are arranged, this column will use Itself personal settings substitutes inherits, from the setting of website, the relative set information come；

3rd step, asks the column seed URL page, and the url filtering regular expression using current column extracts target web Set of URL closes, and each column collects a set of URL and closes；

4th step, the individual element in being closed with each current set of URL for target URL, spell by the domain name using the setting of current column Connect the column information setting in the complete corresponding second step of target URL combination and form new many tuples, the information in many tuples Include the page coding of this target URL, web site name, column title, page coding, title decimation rule, author extract rule Then, text extracting rule, issuing time decimation rule, Type of website etc.；Finally by the many tuple press-in real-time queues setting In, target URL is bound by this process with decimation rule, accomplishes to correspond.

10. it is based on distributed network data Quick Acquisition method as claimed in claim 9 it is characterised in that step 3 is concrete For：Page download reads the many tuples setting in real-time queue, request target by step 2 in real time with the node in analytic sheaf The page, for asking successful target pages, is parsed to the page using the corresponding decimation rule in many tuples, for anon-normal The information extraction of civilian part, if rule was lost efficacy for empty or using current rule extraction, quotes rule and for empty exception or takes out Take unsuccessfully abnormal and log；For text extracting, if the text extracting rule of setting for sky or extracts unsuccessfully, adopt General text extracting method, if extracting unsuccessfully, quoting and general extracting unsuccessfully abnormal and log；If extracting successfully, will be right The data answered is stored in data storage layer in the way of wall scroll record.