CN102937989A

CN102937989A - Parallel distributed internet data capture method and system

Info

Publication number: CN102937989A
Application number: CN2012104225715A
Authority: CN
Inventors: 杨睿尘
Original assignee: Beijing Tengyi Science & Technology Development Co Ltd
Current assignee: Beijing Tengyi Science & Technology Development Co Ltd
Priority date: 2012-10-29
Filing date: 2012-10-29
Publication date: 2013-02-20
Anticipated expiration: 2032-10-29
Also published as: CN102937989B

Abstract

The invention provides a parallel distributed internet data capture method and system. The method comprises the following steps of: setting capture configuration information of a destination site; capturing links of texts on a webpage index page one by one from the webpage index page of the destination site according to the capture configuration information, and deepening the links of the texts to crawl paging information and contents of the texts; and judging whether the texts contain review data, and if so, further deepening links of review pages to crawl review paging information and review contents. The parallel distributed internet data capture method and system have the advantages of high quality and high efficiency.

Description

Parallelization distributed interconnection data grab method and system thereof

Technical field

The present invention relates to Computer Applied Technology field and areas of information technology, be specifically related to a kind of parallelization distributed interconnection data grab method and system thereof.

Background technology

Now, the development of internet is maked rapid progress, and netizen's quantity of China also is being explosive growth.The internet progressively replaces traditional media (comprising newspaper, books, broadcasting, TV etc.), becomes the main source that people obtain and release news.Simultaneously, because the internet is free and open, it is simple to use, velocity of propagation is fast, the user is numerous, so that internet information can be propagated and impact rapidly.More and more important just because of the internet role, so various research for internet information is also flourish.In order to carry out the research of internet information, at first need the Internet web page information crawl that the form of magnanimity is different to process, and carry out unified format conversion, process to make things convenient for post analysis; Secondly, need to use high-quality and high-level efficiency crawl technology.Just be based on this active demand, we have developed parallelization distributed interconnection data grasping system.

Summary of the invention

The present invention one of is intended to solve the problems of the technologies described above at least to a certain extent or provides at least a kind of useful commerce to select.For this reason, one object of the present invention is to propose a kind of parallelization distributed interconnection data grab method and system thereof with high-quality and high-efficiency.

An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, comprising: the crawl configuration information of Offered target website; According to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; And judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.

In an embodiment of method of the present invention, also comprise: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.

In an embodiment of method of the present invention, described data crawl is to carry out with the distributed pattern of parallelization.

Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grasping system, comprising: configuration module, and described configuration module is used for the crawl configuration information of user's Offered target website; The text handling module according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter; Judge module, described judge module are used for judging whether described text comprises comment data; And the comment handling module, described comment handling module is used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.

In an embodiment of system of the present invention, also comprise: logger module, described logger module are used for log information occurring when unusual when the crawl process, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.

In an embodiment of system of the present invention, described data capture module is the parallelization distributed frame.

In sum, at first, the present invention proposes a parallelization distributed interconnection data grab method and system, it can expand the targeted sites that needs crawl freely by the mode of configuration, and adopted parallelization and Distributed Design, so that data crawl efficient and real-time are guaranteed.Secondly, the present invention has adopted and has declared flexibly heavy and increment grasping mechanism, under the prerequisite that need not the secondary Query Database, just monitor for the file of local page storing path, just can realize grasping declaring heavily and the increment crawl of webpage, guarantee the uniqueness of the data that grasp, saved a large amount of software and hardware resources.Moreover the present invention can also support for dynamically grasping with the unified of static two class webpages.Therefore, method and system of the present invention has high-quality and high efficiency advantage.

Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment in conjunction with following accompanying drawing, wherein:

Fig. 1 is the process flow diagram of the parallelization distributed interconnection data grab method of the embodiment of the invention;

Fig. 2 is the structured flowchart of the parallelization distributed interconnection data grasping system of the embodiment of the invention;

Fig. 3 is the detail flowchart under normal circumstances of the parallelization distributed interconnection data grab method of the embodiment of the invention;

Fig. 4 is the detail flowchart under the abnormal conditions of parallelization distributed interconnection data grab method of the embodiment of the invention;

Fig. 5 is the layout structure schematic diagram of data capture module of the parallelization distributed nature of the embodiment of the invention; And

Fig. 6 is the unified crawl of the sound attitude webpage schematic diagram of the embodiment of the invention.

Embodiment

The below describes embodiments of the invention in detail, and the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or the element with identical or similar functions from start to finish.Be exemplary below by the embodiment that is described with reference to the drawings, be intended to for explaining the present invention, and can not be interpreted as limitation of the present invention.

In description of the invention, it will be appreciated that, term " " center "; " vertically "; " laterally "; " length "; " width "; " thickness ", " on ", D score, " front ", " afterwards ", " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward ", " clockwise ", orientation or the position relationship of indications such as " counterclockwise " are based on orientation shown in the drawings or position relationship, only be for convenience of description the present invention and simplified characterization, rather than device or the element of indication or hint indication must have specific orientation, with specific orientation structure and operation, therefore can not be interpreted as limitation of the present invention.

In addition, term " first ", " second " only are used for describing purpose, and can not be interpreted as indication or hint relative importance or the implicit quantity that indicates indicated technical characterictic.Thus, one or more these features can be expressed or impliedly be comprised to the feature that is limited with " first ", " second ".In description of the invention, the implication of " a plurality of " is two or more, unless clear and definite concrete restriction is arranged in addition.

In the present invention, unless clear and definite regulation and restriction are arranged in addition, broad understanding should be done in the terms such as term " installation ", " linking to each other ", " connection ", " fixing ", for example, can be to be fixedly connected with, and also can be to removably connect, or connect integratedly; Can be mechanical connection, also can be to be electrically connected; Can be directly to link to each other, also can indirectly link to each other by intermediary, can be the connection of two element internals.For the ordinary skill in the art, can understand as the case may be above-mentioned term concrete meaning in the present invention.

In the present invention, unless clear and definite regulation and restriction are arranged in addition, First Characteristic Second Characteristic it " on " or D score can comprise that the first and second features directly contact, can comprise that also the first and second features are not directly contacts but by the other feature contact between them.And, First Characteristic Second Characteristic " on ", " top " and " above " comprise First Characteristic directly over Second Characteristic and oblique upper, or only represent that the First Characteristic level height is higher than Second Characteristic.First Characteristic Second Characteristic " under ", " below " and " below " comprise First Characteristic under the Second Characteristic and tiltedly, or only represent that the First Characteristic level height is less than Second Characteristic.

The invention belongs to Computer Applied Technology field and areas of information technology, relate generally to the realization of the oriented network reptile that crawls based on depth-first.Web crawlers is based on basis and the prerequisite that internet information is analyzed, and all analysis operations all are to carry out on the basis of web crawlers crawl magnanimity internet data.

The fundamental purpose of patent of the present invention is for the efficiently and accurately that solves the magnanimity internet data grasps, and the crawl data are declared heavy and the increment crawl flexibly, the problem of the three aspects: such as unified crawl of dynamic and static webpage.Because being the ageing of data, an outstanding requirement of Internet-based data analysis wants high, and the internet data amount all is very surprising, so in order to guarantee the comprehensive of internet data, a kind of technology of crawl magnanimity internet data that can efficiently and accurately need to be arranged.We have developed parallelization distributed interconnection data grasping system for this reason.In order to realize the distributed characteristics of its parallelization, realized that a kind of crawl data of flexible and efficient rate are declared heavy and increment grasping mechanism, guarantee the uniqueness of crawl data and the high usage of software and hardware resources.Dynamic and the static two class webpages owing to the existence that is present in now the webpage on the internet.Therefore, our grasping system must support simultaneously that for dynamic and static Web page, what we taked is on all four grasping means and flow process, has reduced the complicacy of program and the difficulty of later maintenance to the crawl of dynamic and static Web page.

An aspect of of the present present invention proposes a kind of parallelization distributed interconnection data grab method, as shown in Figure 1, comprising: the crawl configuration information of S1. Offered target website; S2. according to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; S3. and judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.

Another aspect of the present invention also proposes a kind of parallelization distributed interconnection data grasping system, comprising: configuration module 100, and described configuration module is used for the crawl configuration information of user's Offered target website; Text handling module 200 according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter; Judge module 300, described judge module are used for judging whether described text comprises comment data; And comment handling module 400, described comment handling module is used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.

For making those skilled in the art understand better technical scheme of the present invention, be further described below in conjunction with Fig. 3-Fig. 6.

The present invention relates generally to the efficiently and accurately crawl that how to solve the magnanimity internet data, how to declare flexibly heavy and the increment crawl for the crawl data, and the problem that how to realize the three aspects :s such as unified crawl of dynamic and static webpage.This part is with the overall realization situation of this system of paper; Secondly on overall basis of realizing, the concrete condition of introducing respectively various piece realizes with relevant program.

1. the overall realization situation of internet data grasping system

The overall design of parallelization distributed interconnection data crawls realizes, can be summarised as: take system in advance for the good crawl configuration information of targeted website space of a whole page human configuration as input.The log-on data grasping system is begun by website space of a whole page index page, grasps one by one all text links that occur on it, and gos deep into this text link crawl text and divide page information and body matter.Simultaneously, if certain piece of text comprises comment data, then further grasp the comment content on the review pages (comprising the paging comment).

The operational scheme of data grasping system as shown in Figure 3, the abnormality processing flow process is as shown in Figure 4.

2 internet data grasping system specific implementation situations

This part is according to the efficiently and accurately crawl that how to solve the magnanimity internet data, how to declare flexibly heavy and the increment crawl for the crawl data, and the problem that how to realize the three aspects :s such as unified crawl of dynamic and static webpage, the realization situation that the every aspect that makes introductions all round is concrete.

1) efficiently and accurately of magnanimity internet data crawl

In order to solve the efficiently and accurately crawl of magnanimity internet data, the present invention considers to solve from two aspects: the one, and parallelization namely starts a plurality of examples that crawl at same crawler server, walks abreast and carries out the operation that crawls of network data; The 2nd, distributed, namely simultaneously multiple servers deploy reptile.Reptile on every station server can both work alone simultaneously.The schematic diagram of parallelization distributed interconnection data grasping system as shown in Figure 5, whole data grasping system launches round a central database.Around this central database, dispose many crawler servers, and moving simultaneously a plurality of threads that crawl on each crawler server.A kind of like this program structure and realization have guaranteed the efficient in real time problem of crawl of magnanimity internet data.But, also introduced simultaneously crawl Data duplication and problem that how increment grasps, and this next part problem that need to solve just.

2) the crawl data are declared heavy and the increment crawl flexibly

The success or not of internet data grasping system design, important index be exactly its whether supported data declare heavy and increment grasps.On the one hand, because reptile all is circular flow, can grab unavoidably the info web of repetition.On the other hand, because internet information is all in real-time change, for twice different crawl of the same website space of a whole page, because the crawl asynchronism(-nization), may after once in the crawl, the information under the space of a whole page has changed (particularly review information can increase gradually) to some extent.And because a large amount of data are preserved in crawl before, crawl next time just should not go to grasp the data of having preserved again.If repeat crawl, data redundancy can cause analysis result afterwards inaccurate on the one hand; On the other hand, repeat the resources such as crawl meeting serious waste system, the network bandwidth.In order to realize grasping declaring of data function heavy and that increment grasps, native system has adopted a kind of succinct implementation.Namely by checking under the catalogue of preserving the crawl info web whether had the related web page file, judge that whether the webpage of this time crawl is for repeating crawl.Concrete design is when preserving the crawl data, and its bibliographic structure has comprised the webpage relevant information, every grade of catalogue represented respectively webpage from the information such as website, the space of a whole page, the sub-space of a whole page, title.Such bibliographic structure has guaranteed that identical webpage will inevitably point to identical saving contents.Can not preserve file of the same name in the same directory, realize declaring heavily with this.This design advantage is as follows: at first, do not need to carry out frequently alternately with the data access source of outside, only need to obtain information needed by the probe result file, efficient is higher; Secondly, assurance reptile module only has an entrance and exit, the coupling between reduction and other modules.

3) crawl of sound attitude webpage is unified realizes

Page crawl can be divided into static page and crawl with dynamic page and crawl.The crawl of static Web page is very simple, only need to ask static network address then the response contents that obtains to be preserved by http, all texts that just can obtain showing on the webpage and comment content.But, dynamic page just can't simple like this processing, because dynamic page just dynamically generates, do not comprise text shown on the page or comment content in its page source code, just some the JavaScripts orders that have after the page request of sending to server.The various http responses that occur when utilizing the Fiddler tool analysis to open webpage and movable can analyze, and dynamic web page also can access by certain form.These dynamic contents just convert other form to and are kept under the hiding network address, just can obtain dynamic content if can access hiding network address.Therefore, the processing of obtaining the Static and dynamic webpage just is summed up as finds a certain network address that really comprises required content, then asks this network address by http, and the response contents that obtains is preserved, and just can obtain the Static and dynamic content.The schematic diagram that the Static and dynamic webpage is unified the crawl process as shown in Figure 6.The key that the unified crawl of sound attitude webpage realizes is to find target URI.Treatment scheme after finding target URI is just unified, only needs to send the http request by the DownloadPage function to target URI, then just can access corresponding http response, has comprised the content information of webpage in this response.This processing procedure is equally applicable to the crawl for Web page text and review pages.For dynamically why different with the crawl of static Web page, we can find out by Fig. 6, the target URI of static Web page easily finds (easy URI mode combinations by program and configuration forms in other words conj.or perhaps), and the target URI of dynamic web page normally stashes, need to could obtain by the external tool analysis.Utilizing after the external tool analysis draws the pattern and rule of the target URI of dynamic web page or target URI, whole dynamically and the crawl process of static Web page just be unified into a flow process.Usually distinguishing part dynamically and in the static Web page crawl process among the upper figure all is in the phase process that grasps configuration, so in fact the realization of whole capture program is exactly known target URI or target URI rule processing procedure afterwards.

Need to prove, describe and to be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises the module of code of the executable instruction of the step that one or more is used to realize specific logical function or process, fragment or part, and the scope of preferred implementation of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should be understood by the embodiments of the invention person of ordinary skill in the field.

In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or the example in conjunction with specific features, structure, material or the characteristics of this embodiment or example description.In this manual, the schematic statement of above-mentioned term not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or characteristics can be with suitable mode combinations in any one or more embodiment or example.

Although the above has illustrated and has described embodiments of the invention, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art is not in the situation that break away from principle of the present invention and aim can change above-described embodiment within the scope of the invention, modification, replacement and modification.

Claims

1. a parallelization distributed interconnection data grab method is characterized in that, comprises step:

The crawl configuration information of Offered target website;

According to described crawl configuration information, from the space of a whole page index page of targeted website, grasp one by one the link of the text that occurs on the described space of a whole page index page, and the link of going deep into described text crawls text and divides page information and body matter; And

Judge whether described text comprises comment data, if comprise, the link of then further going deep into described review pages crawls comment minute page information and comment content.

2. parallelization distributed interconnection data grab method as claimed in claim 1 is characterized in that, also comprises: when the crawl process occurs when unusual, log information is carried out the retry crawl, until grasp successfully.

3. such as claim 1 and 2 described parallelization distributed interconnection data grab methods, it is characterized in that, described data crawl is to carry out with the distributed pattern of parallelization.

4. a parallelization distributed interconnection data grasping system is characterized in that, comprising:

Configuration module, described configuration module is used for the crawl configuration information of user's Offered target website;

The text handling module according to described crawl configuration information, from the space of a whole page index page of targeted website, grasps the link of the text that occurs on the described space of a whole page index page one by one, and the link of going deep into described text crawls text and divides page information and body matter;

Judge module, described judge module are used for judging whether described text comprises comment data; And

Comment handling module, described comment handling module are used for when described text comprises comment data, and the link of further going deep into described review pages crawls comment minute page information and comment content.

5. parallelization distributed interconnection data grasping system as claimed in claim 4, it is characterized in that, also comprise: logger module, described logger module is used for occurring when unusual when the crawl process, log information, this moment, described parallelization distributed interconnection data grasping system carried out the retry crawl, until grasp successfully.

6. such as claim 4 and 5 described parallelization distributed interconnection data grasping systems, it is characterized in that, described data capture module is the parallelization distributed frame.