CN104598536B

CN104598536B - A kind of distributed network information structuring processing method

Info

Publication number: CN104598536B
Application number: CN201410840847.0A
Authority: CN
Inventors: 常鹏飞; 伍赛; 陈珂; 寿黎但; 陈刚
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2017-10-20
Anticipated expiration: 2034-12-29
Also published as: CN104598536A

Abstract

The invention discloses a kind of distributed network information structuring processing method.Network information gathering task is configured, user's webpage interested is subjected to classification preservation, target web is used as；The network information is acquired, is cooperated jointly by multiple map/reduce processes and gathers webpage and carry out structuring processing, be stored in HDFS file system；Webpage after structuring is handled carries out structuring cluster by the way of tree edit distance；Structuring extraction is carried out to the info web after cluster, is saved in database.Present invention employs distributed framework, using the calculating and storage capacity of cheap computer cluster come the huge network data of processing data amount；Effectively webpage is classified；The mode for employing structuring is extracted and preserved to the network information, facilitates the further analyzing and processing to the network information.

Description

A kind of distributed network information structuring processing method

Technical field

The present invention relates to a kind of web information processing method in network information gathering field, more particularly to a kind of point Cloth network information structuring acquiring and processing method.

Background technology

Distributed system is by effectively being organized cheap computing cluster, performing large-scale data computing and deposit The system of storage.

Distributed system is different from one-of-a-kind system, and single node will be balanced by carrying out data operation and storage using computer cluster The cost between communication between computing capability and node, while it is also contemplated that caused by cluster interior joint failure system effectiveness and The problems such as restorability of data.Hadoop distributed treatments with HDFS distributed file systems are proposed based on Google What Map/Reduce computation models were designed and developed increase income distributed arithmetic and storage system.Because it effectively solves distribution The succinct versatility of intersystem problem and its framework, is obtained in many fields and is widely applied.

Structuring clustering method is one kind in clustering method, different from the method clustered by content, and structuring gathers Class according to structure it is emphasised that clustered, and this is accomplished by different method for measuring similarity.Tree edit distance method is a kind of Weigh tree similarity method, by one tree be converted to another one tree mean two set between carry out it is a series of The insertion of node, deletes and replaces, and operation each time expends certain cost.If the architectural difference of two trees is big, it is meant that operation Cost is high, and running cost is low, shows that the architectural difference of tree is small.What therefore the editing distance of tree was represented is that two tree conversions are required The minimum cost wanted.

Network information gathering is generally also web crawlers, is widely used in internet search engine or other similar websites, To obtain or update the content and retrieval mode of these websites.It can be with all pages that it is able to access that of automatic data collection Hold, handled so that search engine etc. is further.

The information that existing web crawlers is grabbed is to be stored in the form of original web page in storage system.It is such to deposit Storage mode has the following disadvantages that one is that storage needs larger memory space in the form of original web page；Two be the information of storage In have substantial amounts of irrelevant information, such as advertisement；Three be in the form of a web page to information carry out preserve be a kind of semi-structured Mode, relative to the storage mode of structuring, semi-structured storage mode can cause certain to the use of further information Obstacle.

The content of the invention

It is an object of the invention to the deficiency for existing network information acquiring technology, there is provided a kind of distributed network letter Cease structuring acquiring and processing method.

The technical solution adopted by the present invention comprises the following steps：

1) network information gathering task is configured, user's webpage interested is subjected to classification preservation, target is used as Webpage；

2) network information is acquired, is cooperated jointly by multiple map/reduce processes and collection webpage and carry out structure Change is handled, and is stored in HDFS file system；

3) webpage after structuring is handled carries out structuring cluster by the way of tree edit distance；

4) structuring extraction is carried out to the info web after cluster, be saved in database.

Described step 2) specifically include：

2.1) URL seed files are obtained, URL seed file set is preserved to the file to be captured of HDFS file system In, file storage to be captured has the URL to be captured, and sets initial sequence number to be 1；

2.2) judge in file to be captured whether be empty, if so, then jumping to step 2.7)；Otherwise, next step is carried out It is rapid 2.3)；

2.3) the corresponding webpage of each URL seed file in HDFS file system is carried out by map/reduce processes Collection, and web storage file storage in HDFS file system is stored in, the storage of web storage file has crude Webpage；

2.4) webpage captured in web storage file is therefrom extracted by map/reduce processes again and parsed New URL, and new URL is stored in the temporary folder of HDFS file system, temporary folder storage is parsed URL；

2.5) by map/reduce process optimization temporary folders, URL therein is filtered, the URL repeated is removed, so Result is updated in the file to be captured of HDFS file system afterwards；

2.6) by sequence number+1；

2.7) sequence number is judged, if current sequence number is more than crawl depth value D_epthThen enter step 2.7), otherwise Jump to step 2.2)；

2.7) multiple web storage files that above-mentioned steps are obtained are merged into one by map/reduce process Web storage file, and remove the webpage wherein repeated.

Described step 3) clustered by map/reduce processes, comprise the following steps that：

3.1) in the map stages, for step 2) each webpage in web storage file after obtained merging, profit Tree edit distance method is used, the tag tree TREE of each webpage is calculated respectively_xWith target web C each described_iTag tree TREE_CiBetween tree edit distance DIS_Ci, obtain tree edit distance set { DIS_C1, DIS_C2, DIS_C3..., DIS_Cn, and generate Key-value pair<C_i, WEB>, minimum tree edit distance DIS is then chosen from tree edit distance set_Cmin, by minimum tree edit distance DIS_CminCorresponding key-value pair<C_min, WEB>It is transmitted to the reduce stages；

3.2) in the reduce stages, according to above-mentioned key-value pair<C_i, WEB>Webpage with same keys is merged into by middle key assignments One file DOC_CiIt is middle as same class webpage, and be stored in the destination file of HDFS file system folder in, each file DOC_CiThe webpage with same web page structure is saved, structuring website construction result { DOC is obtained_C1, DOC_C2, DOC_C3..., DOC_Cn, complete the structuring cluster of webpage.

Described step 4) according to step 3) in obtained structuring website construction result { DOC_C1, DOC_C2, DOC_C3..., DOC_Cn, each class webpage is extracted, the information in webpage is extracted and is saved in database.

By the Node extraction of corresponding tag tree in the webpage into database corresponding field.

The inhomogeneous webpage uses different extracting modes.

Step 1) network information gathering task is configured, it is interactive interface.It is different from the reptile of search engine, the present invention Main purpose is to be monitored for the specific information source of network.Can be by content as the user of information source webpage interested Type is divided into text message, pictorial information and video information etc., and news information and advertising message etc. can be also divided into by contents attribute. Meanwhile, the renewal frequency of different aforementioned sources is also not quite similar.Configured by network information gathering task, it may be determined that the information of collection The information in source, different acquisition methods are taken different types of information source so as to realize.

Step 2) be based on Hadoop distributed treatments and HDFS distributed file systems, to step 1) defined in network Acquisition tasks carry out distributed crawl.

Step 3) carry out Web page structural cluster, step 2) webpage that captures is still with semi-structured document form It is stored in HDFS, it is impossible to directly extract in database.Step 3) by the way of tree edit distance, its structure is pressed to webpage Clustered.Webpage is write using HTML, and single html file can be abstracted into the form of a tag tree, with phase With the webpage of information, its tag tree also has similar or even identical structure.In order to weigh the similitude of tag tree, the present invention is adopted With the method for tree edit distance, by step 2) webpage that grabs classified.

Step 4) to info web structuring extract, according to step 3) in structuring cluster result, and step 1) in To the extracting mode of each class webpage, the information in webpage is extracted and is saved in database.

The invention has the advantages that：

Present invention employs distributed framework, handled using the calculating and storage capacity of cheap computer cluster The huge network data of data volume；The Web page structural cluster mode of tree edit distance is employed, effectively webpage is divided Class；The mode for employing structuring is extracted and preserved to the network information, is facilitated at the further analysis to the network information Reason.

Brief description of the drawings

Fig. 1 is implementation steps flow chart of the present invention.

Fig. 2 is step 3.1 of the present invention) in web page tag tree.

Embodiment

The invention will be further described with reference to the accompanying drawings and examples.

As shown in figure 1, the present invention comprises the following steps：

1) network information gathering task is configured, user's webpage interested is subjected to classification preservation, target is used as Webpage；It can obtain the target web set { C that subsequent step is used for clustering₁, C₂, C₃..., C_n}；The target web classification of the present invention During preservation, the different information types in same website can be classified, such as there may be info class in same website can be divided into newly Hear class, product data class and picture category etc..

2) network information is acquired, is cooperated jointly by multiple map/reduce processes and collection webpage and carry out structure Change is handled, and is stored in HDFS file system.

Step 2) Hadoop distributed treatments and HDFS distributed file systems are based on, to the network defined in step 1 Acquisition tasks carry out distributed crawl.

2.4) webpage captured in web storage file is therefrom extracted and parsed by map/reduce processes again Go out new URL, and new URL is stored in the temporary folder of HDFS file system, temporary folder storage is parsed The URL arrived；

2.6) by sequence number+1；

2.7) sequence number is judged, if current sequence number is more than crawl depth value D_epthThen enter step 2.7), otherwise Jump to step 2.2), repeat above-mentioned steps 2)~6) until current sequence number is more than crawl depth value D_epth；

2.7) multiple web storage files that above-mentioned steps are obtained are breathed out according to webpage by map/reduce process Uncommon value merges into a web storage file, and removes the webpage wherein repeated；Can be by the web storage file after merging It is stored in the web page files of HDFS file system folder, this document folder preserves the webpage grabbed.

Step 3) clustered by map/reduce processes, comprise the following steps that：

3.1) in the map stages, for step 2) each webpage in web storage file after obtained merging, example Such as step 1) in user's webpage interested, such as a webpage in a certain Website News class webpage, the map stages are by target network Page is extracted into the form of tag tree with the webpage grabbed, as shown in Figure 2.Using tree edit distance method, calculate respectively each The tag tree TREE of individual webpage_xWith target web C each described_iTag tree TREE_CiBetween tree edit distance DIS_Ci, obtain To tree edit distance set { DIS_C1, DIS_C2, DIS_C3..., DIS_Cn, and generate key-value pair<C_i, WEB>, then from tree editor away from Tree edit distance DIS minimum from selection in set_Cmin, by minimum tree edit distance DIS_CminCorresponding key-value pair<C_min, WEB>Pass To the reduce stages；

3.2) in the reduce stages, according to above-mentioned key-value pair<C_i, WEB>Webpage with same keys is merged into by middle key assignments One file DOC_CiIt is middle as same class webpage, and be stored in the destination file of HDFS file system folder in, destination file folder deposits There is file DOC_C1, file DOC_C2, file DOC_C3..., file DOC_Cn, each file DOC_CiSave with same web page knot The webpage of structure, obtains structuring website construction result { DOC_C1, DOC_C2, DOC_C3..., DOC_Cn, the structuring for completing webpage gathers Class.

Step 4) according to step 3) in obtained structuring website construction result { DOC_C1, DOC_C2, DOC_C3..., DOC_Cn, Each class webpage is extracted, the information in webpage is extracted and is saved in database, extraction can be used in webpage The mode of the Node extraction of corresponding tag tree corresponding field into database.

Inhomogeneous webpage uses different extracting modes, the extracting mode { R of each class webpage of definable₁, R₂, R₃..., R_n, the information in webpage is extracted and is saved in database.

The embodiment of the present invention is as follows：

For certain electric automobile website, user will obtain news category webpage therein and automobile model parameter class webpage, For this two classes webpage, each class obtains a typical webpage as target web, so as to form target web set { C₁, C₂}。

Step 2) in distributed crawl is carried out to the electric automobile website, obtain its web data, the process is time-consuming with treating It is relevant with the cluster scale for performing crawl task to capture website scale, for the cluster of ten nodes, crawl speed at full capacity 100,000/hour can be reached.

Step 3) webpage that will be got from the electric automobile website, news category and automobile model parameter are divided into by cluster The class webpage of class two, the accuracy rate of process cluster can accomplish more than 95%.

Step 4) for the extraction of this two classes webpage progress structuring, for news category webpage, extract title, text, hair The cloth date, source etc. information, be saved in database；For automobile model parameter class webpage, different model parameter extraction is gone out Come, be saved in database.

Wherein, a kind of R of extracting mode_iForm：Wherein PubTitle, Content, PublicationDate, DCSource is the field in database, and each field is followed by a webpage xpath path, defines the field in webpage In position.To each class web page files DOC_Ci, utilize its corresponding extracting mode R_i, corresponding data in webpage are extracted Corresponding field in database.

It can be seen that, during whole crawl, it is only necessary to which user provides the exemplary web page in two class webpages interested, The information of corresponding types in targeted website, the extraction of structuring simultaneously can be just saved in database by this method, whole process There is provided very fast information handling rate on the basis of keeping compared with high-accuracy (embodiment obtain more than 95%).

Thus, the present invention be based on HDFS file system, for the network information it is semi-structured the characteristics of, it is proposed that using tree compile Collect distance to cluster webpage according to structure of web page, on the basis of cluster result, the webpage of each class is carried according to mode Win the confidence breath, and be saved in database, so that the structuring collection of info web is realized, with significant technique effect.

Claims

1. a kind of distributed network information structuring processing method, it is characterised in that as follows the step of this method：

1) network information gathering task is configured, user's webpage interested is subjected to classification preservation, target web is used as；

2) network information is acquired, is cooperated jointly by multiple map/reduce processes and gather webpage and carry out at structuring Reason, is stored in HDFS file system；

4) structuring extraction is carried out to the info web after cluster, be saved in database；

Described step 2) specifically include：

2.1) URL seed files are obtained, URL seed file set is preserved into the file to be captured of HDFS file system, File storage to be captured has the URL to be captured, and sets initial sequence number to be 1；

2.2) judge in file to be captured whether be empty, if so, then jumping to step 2.7)；Otherwise, next step is carried out 2.3)；

2.3) each corresponding webpage of URL seed files in HDFS file system is adopted by map/reduce processes Collection, and web storage file storage in HDFS file system is stored in, the storage of web storage file has crude net Page；

2.4) by map/reduce processes the webpage captured in web storage file is therefrom extracted again parse it is new URL, and new URL is stored in the temporary folder of HDFS file system, temporary folder storage has what is parsed URL；

2.5) by map/reduce process optimization temporary folders, URL therein is filtered, the URL repeated is removed, then will As a result it is updated in the file to be captured of HDFS file system；

2.6) by sequence number+1；

2.7) multiple web storage files that above-mentioned steps are obtained are merged into by a webpage by map/reduce process Storage folder, and remove the webpage wherein repeated；

3.1) in the map stages, for step 2) each webpage in web storage file after obtained merging, utilize tree Edit distance approach, calculates the tag tree TREE of each webpage respectively_xWith target web C each described_iTag tree TREE_CiBetween tree edit distance DIS_Ci, obtain tree edit distance set { DIS_C1, DIS_C2, DIS_C3..., DIS_Cn, and generate Key-value pair<C_i, WEB>, minimum tree edit distance DIS is then chosen from tree edit distance set_Cmin, by minimum tree edit distance DIS_CminCorresponding key-value pair<C_min, WEB>It is transmitted to the reduce stages；

3.2) in the reduce stages, according to above-mentioned key-value pair<C_i, WEB>Webpage with same keys is merged into one by middle key assignments File DOC_CiIt is middle as same class webpage, and be stored in the destination file of HDFS file system folder in, each file DOC_CiProtect The webpage with same web page structure has been deposited, structuring website construction result { DOC is obtained_C1, DOC_C2, DOC_C3..., DOC_Cn, it is complete Structuring into webpage is clustered.

2. a kind of distributed network information structuring processing method according to claim 1, it is characterised in that：Described step It is rapid 4) according to step 3) in obtained structuring website construction result { DOC_C1, DOC_C2, DOC_C3..., DOC_Cn, to each class net Page is extracted, and the information in webpage is extracted and is saved in database.

3. a kind of distributed network information structuring processing method according to claim 2, it is characterised in that：By the net The Node extraction of corresponding tag tree corresponding field into database in page.

4. a kind of distributed network information structuring processing method according to claim 2, it is characterised in that：It is inhomogeneous The webpage uses different extracting modes.