Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and the embodiments.Should be appreciated that embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
Please refer to Fig. 1, be the process flow diagram of the embodiment of method of collecting web page of the present invention, its detailed process comprises:
Step 101: the inlet URL of system's operation is set, is kept in the url database;
Step 102:URL input control desk obtains URL one by one from url database, and resolves corresponding host name according to URL;
Step 103:URL input control desk sends the host name asks control desk to DNS;
Step 104:DNS request control desk mates the host name received and the host name Hash table (HostName Hash Table) of its maintenance, judge whether described host name has successfully carried out the DNS request, if then carry out step 108, if not, then carry out step 105;
Preserved the DNS request results of host name correspondence in the described host name Hash table.If the match is successful therefore, IP information result that this host name correspondence has been arranged in the present DNS database then has been described, this URL can be sent to webpage this moment and obtain control desk for obtaining, if not success, illustrate that then this host name also was not requested at present, but perhaps once-requested is crossed and is made a mistake etc.
Described host name is sent to DNS result treatment unit (DNSresult collection) with step 105:DNS request control desk and the blacklist of its maintenance mates, and judges whether that the match is successful, if, then carry out step 106, if not, then carry out step 107;
Step 106: abandon described host name;
Step 107: confirm that described host name need carry out DNS request, by DNS request control desk described host name is sent to DNS sub-element main frame and carries out the DNS request, and the URL of described host name correspondence is sent to url database waits for next time and be acquired.
DNS request control desk carries out DNS request detailed process:
1, DNS request control desk distributes DNS request task to give DNS sub-element main frame (ADNS, Asynchronous Domain Name System);
Comprise a task queue in the DNS request control desk, when URL input control desk sends to DNS request control desk with the host name, the host name can be inserted in this formation.DNS request control desk is responsible for the host name in the formation is distributed to subordinate's DNS sub-element main frame, so this module only plays the scheduling effect, and does not directly initiate the DNS request.For suitable allocating task, can have following two kinds of methods to realize:
First method: distribute DNS request task according to the system resource occupancy of DNS sub-element main frame;
DNS request control desk can monitor the CPU, internal memory, Thread Count, network occupancy of subordinate DNS sub-element main frame etc., when the system resource occupancy of DNS sub-element main frame is higher, does not then give this DNS sub-element host assignment DNS request task; If when the resource occupation rate is enough to initiate a DNS request, then give the corresponding task of this DNS sub-element host assignment.
Second method: the hash value to the host name is carried out complementation, carries out allocating task according to operation result.
DNS request control desk carries out Hash operation one time to each host name, hash value to the host name is carried out complementation, be num=(Hash mod N), N is the number of DNS sub-element main frame, the num numerical value of trying to achieve like this is DNS sub-element main frame numbering, is about to this host name and distributes to num platform DNS sub-element main frame and carry out the DNS request.
2, DNS sub-element main frame carries out the DNS request, and the result of DNS request is sent to DNS result treatment unit;
The quantity of DNS sub-element main frame can be determined according to the size of system, is three main frames generally speaking, installation code ADNS storehouse on main frame, and design is initiated the program of request DNS and is called the interface in this storehouse.Owing to be asynchronous DNS request, this main frame can needn't wait for that the result returns and then initiate DNS request next time, thereby realize multithreading DNS request that DNS result's efficient is obtained in raising after initiation is once asked.
3, DNS result treatment unit manages the result of DNS request.
The result that each DNS sub-element main frame of DNS result treatment element analysis returns, the successful DNS request results of request is kept in the DNS database, to send to DNS mismanage unit owing to the DNS request that the request that a variety of causes causes is failed, and in order to upgrade host name Hash table;
DNS result treatment unit also can be expanded, and correct DNS result is carried out analysis and judgement, to the result different priority is set, and deposits database in.DNS fault management modules (Manage DNS errorHost) can also be classified according to error category to the host name of not returning correct DNS result, safeguard a blacklist that repeatedly can't return correct result's host name composition, take place to prevent certain host name still requested situation after repeatedly request is all failed.
Step 108:DNS request control desk will be asked successful DNS to ask pairing URL to be sent to webpage and be obtained control desk;
Step 109: simultaneously, webpage obtains control desk according to host name among the URL and described DNS request results, and for example, the pairing IP of this host name address obtains the page;
Whether webpage obtains control desk and safeguards a URL Hash table, successfully obtained to judge the specified page of URL.Described URL Hash table has been used to preserve the situation that the specified page of URL is acquired.
Step 109 and abovementioned steps 104-step 108 are carried out simultaneously, and promptly when DNS request control desk carries out the DNS request one by one, webpage obtains control desk and also obtains carrying out the page according to the successful ULR of DNS request.
Webpage obtains control desk to carry out the page according to described host name and obtains detailed process and comprise:
1, webpage obtains control desk is inquired about this URL correspondence in the DNS database according to the host name of URL correspondence DNS result;
If in the DNS database, can't find the DNS result of corresponding host name, the DNS result that this host name correspondence then is described is expired, need to initiate again request, can turn back to this URL the URL formation this moment, the host name of this URL is sent to the expired administrative unit of DNS, this host name is upgraded in hostname Hash table, represented that this host name does not have corresponding correct DNS result at present in database, need ask dns server again to obtain the result.
2, the webpage DNS result that obtains the host name correspondence that control desk obtains URL and inquiry is dispensed to webpage and obtains the sub-element main frame;
Webpage obtains the control desk allocating task and obtains the sub-element main frame to webpage, can have following two kinds of methods to realize:
First method: the system resource occupancy URL and the corresponding DNS result that obtain the sub-element main frame according to webpage;
Webpage obtains the internal memory that control desk can monitor subordinate's sub-element, Thread Count, network occupancy etc., when webpage obtains the system resource occupancy of sub-element main frame when higher, then do not give this host assignment task,, then distribute corresponding task if when the resource occupation rate is sufficient;
Second method: the hash value to described URL is carried out complementation, carries out allocating task according to operation result
Webpage obtains control desk each URL is carried out Hash operation one time, hash value to this URL is carried out complementation, be num=(Hash mod N), N is the number that webpage obtains the sub-element main frame, the num numerical value of trying to achieve like this is webpage and obtains sub-element main frame numbering, is about to this URL and distributes to num platform webpage and obtain the sub-element main frame and obtain.
3, webpage obtains the resource that the sub-element main frame obtains the URL appointment.
Webpage obtains the quantity of sub-element main frame and can determine according to the size of system, be three main frames generally speaking, the main initiation HTTP GET that is responsible for asks, obtain page HTML code, and the page HTML code of successfully obtaining is sent to HTML code preservation unit and HTML resolution unit respectively; When obtaining page HTML code also various mistakes might appear, at this moment, the URL information that makes a mistake need be sent to URL Hash table, the information of this URL is not changed to successfully obtain, successfully obtain to guarantee in this URL acquisition process afterwards, can not takeed for.
Step 110:HTML code is preserved the unit and is preserved HTML code;
HTML code is preserved the unit page HTML code information and the corresponding URL information that successfully obtain is kept in the html data storehouse with reasonable manner, sets up suitable index, uses in order to inquiry.
Step 111:HTML resolution unit is resolved HTML code, extracts URL;
The HTML resolution unit is extracted the information such as hyperlink in the page HTML code of successfully obtaining, and after will extracting the result and carrying out verification of correctness, is kept in the url database.In leaching process, can be at the label in the HTML code, such as<A〉href in the tag attributes etc.,<AREA〉the location attribute etc. of label extracts URL information, then these URL are verified, to guarantee the requirement of obtaining of these URL compliance with system, mainly be to judge that whether legal whether legal the suffix of host name ending and filename etc.If verify that this URL is legal, then preserve, if illegal, then abandon this URL.
Step 112:ULR database is preserved the URL information that extracts.
The method of the collecting web page that embodiment provides among the present invention, do not disturb mutually between each functional unit and do not conflict, when obtaining page HTML code, also constantly initiating the DNS request, thereby guaranteeing that system high-speed turns round efficiently according to the successful URL of DNS request.
The present invention also provides another embodiment of collecting web page method, and it specifically comprises:
From url database, obtain URL one by one, and obtain corresponding main frame host name according to URL;
According to described host name, carry out DNS DNS request;
When carrying out DNS DNS request,, carry out the page and obtain according to the successful URL of DNS request.
Please refer to Fig. 2, be the structural drawing of a collecting web page embodiment of system of the present invention.The system of described collecting web page comprises that URL input control desk 21, DNS Request Processing device 22, webpage obtain treating apparatus 23.
Described URL input control desk 21 is used for obtaining URL one by one from url database, and resolves corresponding host name according to the described URL that obtains;
Described DNS Request Processing device 22 is used for according to described host name, carries out the DNS request, and the concurrent URL that refers to the DNS request of the merit of hoping for success obtains treating apparatus 23 for described webpage;
Described webpage obtains treating apparatus 23, is used for when described DNS Request Processing device 22 carries out the DNS request, and the successful URL of DNS request according to receiving from DNS Request Processing device 22 carries out the page and obtains.
Described DNS Request Processing device 22 further comprises:
DNS asks control desk 221, is used for according to described host name, distributes DNS request task, and the concurrent URL that refers to the DNS request of the merit of hoping for success imports control desk to URL;
Described DNS request control desk 221 is maintenance host name Hash table 2211 also, has preserved the DNS request results of host name correspondence in the described host name Hash table 2211.DNS request control desk 221 mates the host name Hash table (Host Name Hash Table) of host name and its maintenance, judges whether described host name has successfully carried out the DNS request.
DNS sub-element main frame 222 is used for carrying out the DNS request according to the DNS request task that DNS request control desk 221 distributes.
Described DNS Request Processing device 22 can also comprise:
DNS result treatment unit 223 is used for the DNS request results that DNS sub-element main frame 222 returns is analyzed and managed.
Described DNS Request Processing device 22 also comprises:
DNS database 224, preserving by 223 analyses of DNS result treatment unit is the successful DNS request results of request;
DNS mismanage unit 225, preserving by 223 analyses of DNS result treatment unit is the DNS request results of request failure.
Described DNS Request Processing device 22 also comprises:
The expired administrative unit 226 of DNS to the DNS request results regular update in the described DNS database 224, is deleted expired DNS request results.
Described webpage obtains treating apparatus 23 and further comprises:
Webpage obtains control desk 231, is used for the DNS result according to described host name inquiry URL correspondence, and distributes described URL and corresponding DNS result;
Whether described webpage obtains control desk 231 and safeguards a URL Hash table 2311, successfully obtained to judge the specified page of described URL, and described URL Hash table is used to preserve the situation that the specified page of URL is acquired.
Webpage obtains sub-element main frame 232, is used for obtaining URL and the corresponding DNS result that control desk 231 distributes according to webpage, obtains page HTML code.
Described webpage obtains treating apparatus 23 and also comprises:
HTML code is preserved unit 233, is used to preserve webpage and obtains the page HTML code that sub-element main frame 232 obtains;
Html data storehouse 234 is used for preserving webpage and obtains the correct HTML code of page HTML code that sub-element main frame 232 obtains;
HTML resolution unit 235 is used for analyzing web page and obtains the page HTML code that sub-element main frame 232 obtains, and extracts URL;
ULR database 236 is used to preserve the URL information that HTML resolution unit 235 extracts.
By collecting web page method and the system thereof that the embodiment of the invention provides, the DNS request and the page obtain respectively and carry out simultaneously, therefore when obtaining page code, are also constantly carrying out the DNS request, thereby are improving the running efficiency of collecting web page.
One of ordinary skill in the art will appreciate that all or part of step that realizes in the foregoing description method is to instruct relevant hardware to finish by program, described program can be stored in the computer read/write memory medium, and described storage medium is ROM/RAM, magnetic disc, CD etc.
More than a kind of collecting web page method provided by the present invention and system thereof are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used to help to understand disclosed technical scheme; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.