CN103838786A

CN103838786A - Web data automatic collecting method

Info

Publication number: CN103838786A
Application number: CN201210490953.1A
Authority: CN
Inventors: 苏晓华; 李勇
Original assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date: 2012-11-27
Filing date: 2012-11-27
Publication date: 2014-06-04

Abstract

The invention discloses a Web data automatic collecting method which comprises the following steps that a network robot technology and a webpage data extracting technology are used; the network robot technology comprises the steps of network robot working procedure designing, network robot designing principle making, a depth-first search strategy, a breadth-first search strategy, a network trap, equilibrium access and hyperlink extracting; and the webpage data extracting technology comprises webpage pure text extracting and analyzing and processing on special characters in a text. According to the Web data automatic collecting method, the network robot technology and the webpage data extracting technology are fully used, the Web automatic collecting method is formed, valuable data are collected from mass information and are subjected to analyzing and researching, the foundations for various decisions of enterprises are formed, a problem of data collecting personnel and market study personnel is solved, meanwhile, Web usability is widened, and certain contribution is made to development of data collecting, especially to automatic data collecting.

Description

A kind of method of Web automatic data collection

Technical field

The present invention relates to a kind of data acquisition technology, particularly a kind of method of Web automatic data collection.

Background technology

Along with enriching constantly and the continuous expansion of network information of Internet resources, people are more and more stronger to the dependence of network, find fast own required specific resources to bring inconvenience to also service object from vast as the open sea Internet resources; Information just has unlimited value from ancient times, along with the development in epoch, the mankind have come the information age unconsciously, all trades and professions have all been full of countless information, and the value of information is just the circulation of data, if data can circulate timely and transmit, the incomparable value that competence exertion information is real; At Under the market economy condition, image data has become important instrument and means.

How from magnanimity information, collecting valuable data and to analyze and research, to form the foundation of the various decision-makings of enterprise, is the problem that data acquisition personnel and market researcher face; Will from a large amount of data, find rapidly and obtain own needed information and service, become more and more difficult, service object tends to lose their target or obtain some more biased results in the time of Query Information; Data must could produce value through gathering, integrate, analyzing, and scattered information can only be Improving News, cannot embody real commercial value; For enterprise and information analysis personnel, to in a large amount of information, filter out effective value point on the one hand, reduce again the cost that obtains corresponding information simultaneously, make the actual use value of information be greater than the cost that the processes such as collection, analytical information produce, making information is that the decision-making of enterprise brings added value.

Popularizing of internet, the development of infotech, has formed a large amount of information resources; From the information of magnanimity, extract useful resource, it is current problem in the urgent need to address, and the expressed main information of the Web page is hidden in a large amount of irrelevant structures and word conventionally, make user can not obtain rapidly subject information, limited the availability of Web, Web automatically gathers and contributes to address this problem, automatically it is time saving and energy saving to gather, information broad covered area, but information extraction is of low quality, thus will affect precision ratio; So most data collection task all adopts automatic acquisition mode now; Automatic acquisition technology produces under this background.

Summary of the invention

The present invention is directed to the proposition of above problem, and develop a kind of method of the Web automatic data collection by network robot technology and applying web page data abstraction techniques.

Technological means of the present invention is as follows:

A method for Web automatic data collection, is characterized in that comprising the following steps:

A, network robot technology:

A1, planned network robot workflow: robot is conducted interviews to corresponding WWW document for browsing starting point with one or one group of URL, and described WWW document is html document;

A2, formulation network robot principle of design;

A21, formulate robot project standard not to be covered: on server, create a robot text, link that website can not be accessed and the robot of website denied access are described in text file;

A22, the formulation META of robot label: user adds a META label in the page, whether this META label allows owner's appointment of a page allow robot program to carry out index pages or from the page, extract link;

A3, depth-first search strategy and breadth first search;

A31, depth-first search strategy are from start node, after first document is analyzed, fetch first and link the page pointed, after this page is analyzed, fetch again its first link document pointed, repeatedly carry out until search the document that does not comprise any hyperlink, be defined as a complete chain, then return to a certain document, continue to select all the other hyperlinks in the document, the mark that search finishes is that whole hyperlinks have been searched for complete;

A32, breadth first search are after first document is analyzed, by complete all hyperlink search in this Web page, then the search of the lower one deck of continuation, until the search of the bottom completes;

A4, network trap;

A41, before the new URL of access with to be searched and searched for URL the URL in row list is compared, this is relatively the comparison between URL object, the URL not comprising is added to URL to url list to be searched in row list, to avoid falling into network trap;

A42, ignore all URL that are provided with parameter while extracting the hyperlink of Web document;

A43, the restriction robot searches degree of depth; Stop downward search when arriving after the threshold search degree of depth, wherein often enter into next stage sublink and shown to arrive a new search depth; Or set the maximum time length of access Web server, start timing in the time of first webpage of this Web server of bot access, through after maximum time length, the robot program who creeps on server disconnects the all-links with this server at once;

A5, balanced access; Set the thread maximum number of a Web server of access and adopt waiting mode restriction robot program or the access frequency of process to particular server and the network segment; Whenever robot program or process are from a Web website is obtained a document, this robot program or process are carried out new access to this Web website again by the certain interval of wait, determine the length of stand-by period according to website processing power and network communication ability, the time T 1 of next time accessing this Web website adds the access required time of this Web website for current time T2, accesses the required time value of this Web website and is network latency T3 and be multiplied by and set coefficient;

A6, hyperlink are extracted; Robot program continues the corresponding Web source document of the link obtaining to carry out data acquisition in obtaining URL link, and Web source document is converted to the form of character stream;

B, web data extractive technique;

The extraction of B1, webpage plain text; The html source file obtaining is carried out filtration treatment and deletes label instruction character extraction text message wherein, unified web data character format after filtering web page data;

B2, the special character in text is analyzed and processed.

Owing to having adopted technique scheme, the method of a kind of Web automatic data collection provided by the invention, make full use of network robot technology and web data extractive technique, form Web automatic acquiring method, from magnanimity information, collect valuable data and analyze and research, form the foundation of the various decision-makings of enterprise, solve the problem that data acquisition personnel and market researcher face, expanded the availability of Web simultaneously, to data acquisition, especially certain contribution has been made in the development of automatic data acquisition.

Accompanying drawing explanation

Fig. 1 is network robot workflow diagram of the present invention;

Fig. 2 is the workflow diagram that html web page plain text of the present invention extracts.

Embodiment

Network robot is a kind of software program that can utilize hyperlink in Web document recursively to access new document; Automatically collection mechanism is to utilize one to be network robot be that the software of the search of Robot automatically gathers and joins in index database website and webpage according to certain rule;

The method of a kind of Web automatic data collection as shown in Figure 1 and Figure 2, comprises the following steps:

A, network robot technology:

A1, planned network robot groundwork flow process first, specifically describe as Robot with one or one group of URL as browsing starting point, and to corresponding WWW document its groundwork flow process that conducts interviews, described WWW document is generally html document;

A2, formulation principle of design;

A21, Robots Exclusion standard, be on server, to create a Robots.txt file, illustrates which Robot access which link inaccessible of our station and our station refuse;

A22, Robots META mark are that user can add a META mark in the page of oneself; Whether Robot META mark allows owner's appointment of a page allow Robot program to carry out the page of index oneself or from this page, extract link;

A3, depth-first search strategy and breadth first search;

A31, depth-first search strategy are from start node, after first document is analyzed, fetch first and link the page pointed, then analyze this page, fetch again its first link document pointed, repeatedly carry out, till searching those documents that do not comprise any hyperlink always, this calculates a complete chain, and then return to a certain document, continue to select other hyperlink in the document, the mark that it finishes is to no longer include other hyperlink can search for again;

A32, breadth first search are, after first document is analyzed, first to have searched for all hyperlinks in this Web page, and then continue the search of lower one deck, until the bottom;

Current, in website, the institutional framework of the Web page is directly determining the preference strategy that deviser adopts; Because robot determines search strategy in the mode of url list access, so its key issue is that we regard queue to be searched as queue or storehouse in realization; If regard queue as, new hyperlink adds from tail, from the beginning takes out and forms breadth First traversal; If regard storehouse as, from the beginning from the beginning new hyperlink add takes out and forms depth-first traversal;

A4, network trap;

A41, should be first before the new URL of access and to be searched and searched for URL the URL in row list is compared, only have brand-new URL just can join url list to be searched, so just can avoid falling into network trap; Should note in realization this is relatively comparison between comparison rather than character string between URL object, will avoid the problem of the corresponding same main frame of multiple different URL character strings;

A42, in the time that extracting, the hyperlink of WEB document ignores all URL with parameter;

A43, in the time of actual Robot search, must limit the degree of depth of search; Often enter next stage sublink and just represent to have arrived a new degree of depth, when arriving after the threshold depth of regulation, just stop down searching for again; Or can set the maximum time length of access Web server, first webpage of accessing this Web server as Robot starts timing, through after maximum time length, the Robot program of creeping on server disconnects the all-links with this Web server at once;

A5, balanced access; Reply Web server only uses a few thread accesses; In the time designing program, the maximum number of the thread of a Web server of regulation access, is so just restricted the Thread Count of accessing a Web server; In addition, must limit Robot program or the access frequency of process to particular server and the network segment, its basic skills is " wait "; Whenever Robot program or process are from a Web website is obtained a document, it must wait for that certain interval carries out new access to this Web website again, and the time length of wait is generally determined according to the ability to communicate of the processing power of website and network; Common design is to access the time T of this Web website 1 to access the required time of this Web website for current time T2 adds next time, accessing the required time of this Web website and be mainly T3 that Internet Transmission is taken time is multiplied by one and has set coefficient " good person's coefficient good-guyfactor ", that is: T1=T2+T3*good-guyfactor;

A6, hyperlink are extracted; Wherein concentrate the extracting method of text hyperlink is described, as follows at the grammatical form of html document Chinese version hyperlink:

<A HREF=hyperlink URL address portion > hyperlink display text declaratives </A>

The target that hyperlink is extracted is the hyperlink URL address portion obtaining wherein; Simple search procedure is first by all unified upper case or lower cases of the character of html source file, then locate " HREF " mark after " <A " mark in document, after finding, the link of following is thereafter analyzed, only preserved as webpage format and the not links with parameter such as " .htm ", " .html ", " shtml ", " .jsp ", " .asp " and " .php "; Repeat said process until handle " HREF " mark after all " <A " marks in document; Robot program will constantly be carried out data acquisition to the corresponding WEB source document of the link obtaining in obtaining URL link, to obtain more WEB link and data; Should be converted in realization the form of character stream for demonstration that can be more accurate;

B, web data extractive technique: it has determined efficiency and the quality of information acquisition to a great extent;

The extraction of B1, webpage plain text; First the html source file obtaining is carried out to filtration treatment and extract text message with the Tag instruction character removing wherein; Can on html source file, be handled as follows all " < " marks and " > " mark in realization: the first position of location " < " mark, relocate the position of adjacent thereafter " > " mark, then remove two character strings between position; Or first the position of location " > " mark, relocates the position that adjacent thereafter " < " identifies, the then character string between cumulative two positions; Scripted code has the feature of text described above, so should note getting rid of it in the time extracting text; A kind of mode of eliminating is, in the time that HTML is resolved, start label if run into <script>, just can find </script> end-tag at once, then proceed to resolve thereafter; Another kind of method for removing is tentatively it to be worked as to composition notebook to extract, and then judges whether it is scripted code, if script just will not be collected; Text in a webpage is stored, between each text separating, should be added separator; In the time of actual treatment text, need label be divided into two classes according to the meaning of label: a class is dividing label, another kind of is not dividing label; A rear class label comprises: <A><BGreatT.Grea T.GT<I><EMGreatT .GreaT.GT<T2><BI G><SUB>LEssT.LT ssT.LTSMALL><STRONGGreatT.Gre aT.GT<STRIKE><BR > etc.; This class label does not play compartmentation semantically, occur that such label should think that two texts are continuous between two texts; After web data filters, the form of unified web data character;

B2, processing special character; Some special character appears in text, and therefore text our main object to be processed just will first carry out analyzing and processing to these special characters before processing text; Such as " & copy in html document; All rights reserved & copy; " in browser, will be shown as

we may cause mess code phenomenon while using high-level programming language to resolve HTML; So we must resolve special character ourselves.

The method of a kind of Web automatic data collection provided by the invention, make full use of network robot technology and web data extractive technique, form Web automatic acquiring method, from magnanimity information, collect valuable data and analyze and research, the foundation that forms the various decision-makings of enterprise, has solved the problem that data acquisition personnel and market researcher face, and has expanded the availability of Web simultaneously, to data acquisition, especially certain contribution has been made in the development of automatic data acquisition.

The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in; be equal to replacement or changed according to technical scheme of the present invention and inventive concept thereof, within all should being encompassed in protection scope of the present invention.

Claims

1. a method for Web automatic data collection, is characterized in that comprising the following steps:

A, network robot technology:

A2, formulation network robot principle of design;

A3, depth-first search strategy and breadth first search;

A4, network trap;

B, web data extractive technique;

B2, the special character in text is analyzed and processed.