CN104598536A

CN104598536A - Structured processing method of distributed network information

Info

Publication number: CN104598536A
Application number: CN201410840847.0A
Authority: CN
Inventors: 常鹏飞; 伍赛; 陈珂; 寿黎但; 陈刚
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2015-05-06
Anticipated expiration: 2034-12-29
Also published as: CN104598536B

Abstract

The invention discloses a structured processing method of distributed network information. The method comprises the following steps: configuring a network information acqusition task, and saving interesting webpages of a user in category to serve as target webpages; acquiring the network information, cooperatively acquiring the webpages through multiple map/reduce processes, performing structured processing and saving in an HDFS (Hadoop Distributed File System) file system; performing structured clustering on the webpages after the structured processing by using a tree edit distance mode; performing structured extraction on the clustered webpage information, and saving in a database. A distributed architecture is adopted, a huge data volume of network data can be processed by using the calculation and storage capacity of a cheap computer cluster; the webpages are effectively classified; the network information is extracted and saved by using the structured mode, and further analytical processing of the network information is facilitated.

Description

A kind of distributed network information structuring disposal route

Technical field

The present invention relates to a kind of web information processing method in network information gathering field, particularly relate to a kind of distributed network information structuring acquiring and processing method.

Background technology

Distributed system is by effectively being organized by the computing cluster of cheapness, performs the system of large-scale data computing and storage.

Distributed system is different from one-of-a-kind system, utilize computer cluster carry out data operation and store the cost that will balance between single node computing power and internodal communication, also will consider the problems such as the restorability of the system effectiveness that cluster interior joint fault causes and data simultaneously.Hadoop distributed treatment and HDFS distributed file system are increase income distributed arithmetic and the storage systems that the Map/Reduce computation model proposed based on Google is designed and developed.Due to the succinct versatility that it effectively solves problem in distributed system and its framework, be obtained in a lot of field and apply widely.

Structuring clustering method is the one in clustering method, and different from the method for carrying out cluster by content, structuring cluster is it is emphasised that carry out cluster according to structure, and this just needs different method for measuring similarity.Tree edit distance method is a kind of method weighing tree structure similarity, one tree is converted to another one tree and means the insertion carrying out a series of node between two trees, deletes and replaces, and operation each time expends certain cost.If the textural difference of two trees is large, mean that running cost is high, running cost is low, shows that the textural difference set is little.What the editing distance therefore set represented is two minimum costs setting required for conversion.

Network information gathering is also web crawlers usually, is widely used in internet search engine or other similar websites, to obtain or to upgrade content and the retrieval mode of these websites.It can gather all content of pages that it can have access to automatically, does further process for search engine etc.

The information that existing web crawlers grabs is stored in storage system with the form of original web page.There is following shortcoming in such storage mode, one is store with the form of original web page to need larger storage space; Two is have a large amount of irrelevant informations, as advertisement etc. in the information stored; Three is that to preserve information be in the form of a web page a kind of semi-structured mode, and relative to structurized storage mode, semi-structured storage mode cause certain obstacle can to the use of further information.

Summary of the invention

The object of the invention is to the deficiency for existing network information acquiring technology, provide a kind of distributed network information structuring acquiring and processing method.

The technical solution used in the present invention comprises the following steps:

1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web;

2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system;

3) webpage after structuring process is adopted the mode of tree edit distance, carry out structuring cluster;

4) structuring extraction is carried out to the info web after cluster, be saved in database.

Described step 2) specifically comprise:

2.1) obtain URL seed file, the set of URL seed file is saved to HDFS file system wait capture in file, wait that capturing file deposits the URL that will capture, and to arrange initial sequence number be 1;

2.2) judge that whether wait to capture in file is empty, if so, then jumps to step 2.7); Otherwise, carry out next step 2.3);

2.3) gathered by the webpage that map/reduce process is corresponding to each URL seed file in HDFS file system, and be kept at web storage file in HDFS file system and deposit, crude webpage deposited by web storage file;

2.4) webpage captured in web storage file is therefrom extracted to the URL resolving and make new advances by map/reduce process again, and be kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL parsed;

2.5) by map/reduce process optimization temporary folder, filter URL wherein, the URL of repetition is removed, then by result waiting to capture in file and upgrade in HDFS file system;

2.6) by sequence number+1;

2.7) sequence number is judged, if current sequence number is greater than capture depth value D _epththen enter step 2.7), otherwise jump to step 2.2);

2.7) a web storage file merged into by multiple web storage files above-mentioned steps obtained by the process of map/reduce, and removes the webpage wherein repeated.

Described step 3) carry out cluster by map/reduce process, concrete steps are as follows:

3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively _xwith target web C described in each _itag tree TREE _cibetween tree edit distance DIS _ci, obtain tree edit distance set { DIS _c1, DIS _c2, DIS _c3..., DIS _cn, and generate key-value pair <C _i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set _cmin, by minimum tree editing distance DIS _cmincorresponding key-value pair <C _min, WEB> passes to the reduce stage;

3.2) in the reduce stage, according to above-mentioned key-value pair <C _i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments _cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, each file DOC _cisave the webpage with same web page structure, obtain structuring website construction result { DOC _c1, DOC _c2, DOC _c3..., DOC _cn, complete the structuring cluster of webpage.

Described step 4) according to step 3) in the structuring website construction result { DOC that obtains _c1, DOC _c2, DOC _c3..., DOC _cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.

The field corresponding in database by the Node extraction of tag tree corresponding in described webpage.

Described inhomogeneous webpage adopts different extracting modes.

Step 1) to the configuration of network information gathering task, be mutual interface.Different from the reptile of search engine, fundamental purpose of the present invention is to monitor for the specific information source of network.The interested webpage of user as information source can be divided into text message, pictorial information and video information etc. by content type, also can be divided into news information and advertising message etc. by contents attribute.Meanwhile, the renewal frequency of different aforementioned sources is also not quite similar.Configured by network information gathering task, the information of the information source gathered can be determined, thus realize taking different acquisition methods to different types of information source.

Step 2) based on Hadoop distributed treatment and HDFS distributed file system, to step 1) in definition network collection task carry out distributed crawl.

Step 3) carry out Web page structural cluster, step 2) webpage that captures still is stored in HDFS with semi-structured document form, can not extracting directly in database.Step 3) adopt the mode of tree edit distance, by its structure, cluster is carried out to webpage.Webpage is all adopt HTML to write, and single html file can be abstracted into the form of a tag tree, has the webpage of identical information, and its tag tree also has similar even identical structure.In order to weigh the similarity of tag tree, present invention employs the method for tree edit distance, by step 2) webpage that grabs classifies.

Step 4) info web structuring is extracted, according to step 3) in the result of structuring cluster, and step 1) in extracting mode to each class webpage, the information in webpage is extracted and is saved in database.

The beneficial effect that the present invention has is:

Present invention employs distributed framework, utilize the calculating of cheap computer cluster and storage capacity to process the huge network data of data volume; Have employed the Web page structural cluster mode of tree edit distance, effectively webpage is classified; Have employed structurized mode extract the network information and preserve, facilitate the further analyzing and processing to the network information.

Accompanying drawing explanation

Fig. 1 is the invention process flow chart of steps.

Fig. 2 is step 3.1 of the present invention) in web page tag tree.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described.

As shown in Figure 1, the present invention includes following steps:

1) network information gathering task is configured, interested for user webpage is carried out classification and preserve, as target web; Subsequent step can be obtained and be used for the target web set { C of cluster ₁, C ₂, C ₃..., C _n; When target web classification of the present invention is preserved, can classify to the different information type in same website, info class may be there is in such as same website and can be divided into news category, product data class and picture category etc.

2) network information is gathered, gather webpage by multiple map/reduce process collaborate and carry out structuring process, being kept in HDFS file system.

Step 2) based on Hadoop distributed treatment and HDFS distributed file system, distributed crawl is carried out to the network collection task of definition in step 1.

2.4) therefrom being extracted the webpage captured in web storage file by map/reduce process again and resolve the URL made new advances, and being kept at by new URL in the temporary folder of HDFS file system, temporary folder deposits the URL resolving and obtain;

2.6) by sequence number+1;

2.7) sequence number is judged, if current sequence number is greater than capture depth value D _epththen enter step 2.7), otherwise jump to step 2.2), then repeat above-mentioned steps 2) ~ 6) until current sequence number is greater than capture depth value D _epth;

2.7) multiple web storage files above-mentioned steps obtained by the process of map/reduce merge into a web storage file according to webpage cryptographic hash, and remove the webpage wherein repeated; Web storage file after merging can be kept in the web page files folder of HDFS file system, the webpage grabbed preserved by this file.

Step 3) carry out cluster by map/reduce process, concrete steps are as follows:

3.1) in the map stage, for step 2) each webpage in web storage file after the merging that obtains, such as step 1) the middle interested webpage of user, a webpage in such as a certain Website News class webpage, target web and the webpage grabbed are extracted into the form of tag tree by the map stage, as shown in Figure 2.Utilize tree edit distance method, calculate the tag tree TREE of each webpage respectively _xwith target web C described in each _itag tree TREE _cibetween tree edit distance DIS _ci, obtain tree edit distance set { DIS _c1, DIS _c2, DIS _c3..., DIS _cn, and generate key-value pair <C _i, WEB>, then chooses minimum tree editing distance DIS from tree edit distance set _cmin, by minimum tree editing distance DIS _cmincorresponding key-value pair <C _min, WEB> passes to the reduce stage;

3.2) in the reduce stage, according to above-mentioned key-value pair <C _i, in WEB>, the webpage with same keys is merged into a file DOC by key assignments _cimiddle as same class webpage, and be kept in the destination file folder of HDFS file system, destination file folder has file DOC _c1, file DOC _c2, file DOC _c3..., file DOC _cn, each file DOC _cisave the webpage with same web page structure, obtain structuring website construction result { DOC _c1, DOC _c2, DOC _c3..., DOC _cn, complete the structuring cluster of webpage.

Step 4) according to step 3) in the structuring website construction result { DOC that obtains _c1, DOC _c2, DOC _c3..., DOC _cn, each class webpage is extracted, the information in webpage is extracted and is saved in database, extract the mode that can adopt the Node extraction of tag tree corresponding in webpage field corresponding in database.

Inhomogeneous webpage adopts different extracting modes, the extracting mode { R of each class webpage of definable ₁, R ₂, R ₃..., R _n, the information in webpage is extracted and is saved in database.

Embodiment of the present invention is as follows:

For certain electric automobile website, user will obtain news category webpage wherein and automobile model parameter class webpage, and for this two classes webpage, each class obtains a typical webpage as target web, thus forms target web set { C ₁, C ₂.

Step 2) in distributed crawl is carried out to this electric automobile website, obtain its web data, this process is consuming time relevant with waiting the cluster scale capturing website scale and perform crawl task, and for the cluster of ten nodes, fully loaded captures speed can reach 100,000/hour.

Step 3) webpage that will get from this electric automobile website, be divided into news category and automobile model parameter class two class webpage by cluster, the accuracy rate of this process cluster can accomplish more than 95%.

Step 4) structurized extraction is carried out for this two classes webpage, for news category webpage, extract title, text, date issued, the information such as source, are saved in database; For automobile model parameter class webpage, by different model parameter extraction out, be saved in database.

Wherein, a kind of R of extracting mode _iform: wherein PubTitle, Content, PublicationDate, DCSource are the fields in database, is a webpage xpath path after each field, defines the position of this field in webpage.To each class web page files DOC _ci, utilize the extracting mode R of its correspondence _i, data corresponding in webpage are extracted field corresponding in database.

Can see, in whole crawl process, user is only needed to provide exemplary web page in interested two class webpages, the method just can by the information of corresponding types in targeted website, structurized extraction is also saved in database, whole process, on the basis keeping comparatively high-accuracy (embodiment obtain more than 95%), provides very fast information handling rate.

Thus, the present invention is based on HDFS file system, for the feature that the network information is semi-structured, propose and utilize tree edit distance to carry out cluster to webpage according to structure of web page, on the basis of cluster result, to the webpage of each class according to mode information extraction, and be saved in database, thus realize the structuring collection of info web, there is significant technique effect.

Claims

1. a distributed network information structuring disposal route, is characterized in that the step of the method is as follows:

2. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that: described step 2) specifically comprise:

2.6) by sequence number+1;

3. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that:

4. a kind of distributed network information structuring disposal route according to claim 1, is characterized in that: described step 4) according to step 3) in the structuring website construction result { DOC that obtains _c1, DOC _c2, DOC _c3..., DOC _cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.

5. a kind of distributed network information structuring disposal route according to claim 4, is characterized in that: the field corresponding in database by the Node extraction of tag tree corresponding in described webpage.

6. a kind of distributed network information structuring disposal route according to claim 4, is characterized in that: described inhomogeneous webpage adopts different extracting modes.