CN101303700B - Method and system for collecting web page - Google Patents

Method and system for collecting web page Download PDF

Info

Publication number
CN101303700B
CN101303700B CN2008101112988A CN200810111298A CN101303700B CN 101303700 B CN101303700 B CN 101303700B CN 2008101112988 A CN2008101112988 A CN 2008101112988A CN 200810111298 A CN200810111298 A CN 200810111298A CN 101303700 B CN101303700 B CN 101303700B
Authority
CN
China
Prior art keywords
dns
url
dns request
obtains
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008101112988A
Other languages
Chinese (zh)
Other versions
CN101303700A (en
Inventor
辛阳
雷宇
李娜
刘利锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Digital Technologies Chengdu Co Ltd
Original Assignee
Huawei Symantec Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Symantec Technologies Co Ltd filed Critical Huawei Symantec Technologies Co Ltd
Priority to CN2008101112988A priority Critical patent/CN101303700B/en
Publication of CN101303700A publication Critical patent/CN101303700A/en
Application granted granted Critical
Publication of CN101303700B publication Critical patent/CN101303700B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a web page collection method and a system thereof. The web page collection method includes: obtaining URL from a URL database one by one and obtaining a corresponding host name according to the URL to carry out capturing of the web page content; carrying out DNS request according to the host name; carrying out page capturing according to the URL successfully requested by the DNS simultaneously when carrying out the DNS request on a Domain Name Service system. The DNS request and the page capturing can be respectively and simultaneously carried out by the technical scheme provided by the embodiment of the invention; therefore, the DNS request is ceaselessly carried out simultaneously when a page code is captured, thereby improving the operation efficiency of the page collection.

Description

The method of collecting web page and system thereof
Technical field
The present invention relates to network communications technology field, particularly a kind of method of collecting web page and system thereof.
Background technology
The collecting web page function as network search engines, URL categorizing system, data digging system etc. must obligato system in primary function, powerful complete collecting web page function is that sorts of systems can provide the abundant accurate basis of information comprehensively.
According to incomplete estimation, present several hundred million pages or leaves of throwing the net in the whole world, nearly hundred million website will be included the web data of the so big order of magnitude and upgraded in time, and this will be a very difficult task.In recent years, web page search engine both domestic and external successively becomes everybody widely used research tool when surfing the Net, mostly there is a powerful collecting web page system (WC, Web Crawler) behind of this class instrument.The page to each website on the internet obtains and analyzes, need the typing of mass data could guarantee the Search Results that provides full and accurate, the collecting web page system generally all is that the unified resource with appointment is decided to be symbol (URL, Uniform ResourceLocator) is inlet, by HTML (Hypertext Markup Language) (HTTP, Hyper Text Transfer Protocol) request, obtain the HTML (Hypertext Markup Language) (HTML of this page, Hyper Text Markup Language) code, then the information such as hyperlink in this page are extracted, obtain more URL, then with the URL that extracts for obtaining target, obtain the Internet resources of this URL appointment,, constantly obtain and include web page code by the circulation said process.And in order to raise the efficiency, the system that requires that tries one's best can unduplicatedly obtain webpage, reduces resource consumption.The function of obtaining webpage can be divided into DNS (DNS, DomainName System) requested part and page code are obtained part, after the main frame (host) among the URL is obtained, can initiate the DNS request, obtain after the correct answer, just can pass through http protocol, obtain the resource that this URL points to.
Present distributed webpage collection system is many, wherein relatively commonly used a kind of be exactly distributed first collection system, this system has a plurality of single collecting web page engines, and central engine is that the result with the single engine of these distributions combines and obtains final result.The collection engine of this each unit of system requirements has identical sort algorithm and essentially identical data export structure, so that carry out comprehensive improvement by central engine.Therefore, very high to the anti-pressure ability requirement of central engine, can not well solve for large-scale concurrent situation.And the update efficiency of central engine is lower, and the information source of each yuan engine is unstable or too single easily, influences the whole efficient of including.
In carrying out the invention process, the inventor finds that there are the following problems at least in the prior art: the collecting web page scheme that prior art provides, because the update efficiency of central engine is lower, the information source of each yuan engine is unstable or too single easily, influence the whole efficient of including, therefore, collecting web page efficient is lower.
Summary of the invention
The embodiment of the invention provides a kind of method and system thereof of collecting web page, can improve the running efficiency of collecting web page.
The embodiment of the invention provides a kind of method of collecting web page, and it comprises:
From url database, obtain URL one by one, and obtain corresponding host name according to URL;
According to described host name, carry out the DNS request, the result that DNS is asked manages, and described management comprises and will ask successful DNS request results to be kept in the DNS database;
When carrying out DNS DNS request,, carry out the page and obtain according to the successful URL of DNS request in the described DNS database.
The embodiment of the invention also provides a kind of system of collecting web page, and it comprises:
URL imports control desk, is used for obtaining URL one by one from url database, and resolves corresponding host name according to URL;
DNS Request Processing device, be used for according to described host name, carry out the DNS request, the result that DNS is asked manages, described management comprises and will ask successful DNS request results to be kept in the DNS database, and sends the URL of the DNS request that request is successful in the described DNS database;
Webpage obtains treating apparatus, is used for when described DNS Request Processing device carries out the DNS request, and the successful URL of DNS request according to receiving from DNS Request Processing device carries out the page and obtains.
The method and the system thereof of the collecting web page that provides by the embodiment of the invention, the DNS request and the page obtain respectively and carry out simultaneously, therefore when obtaining page code, also ask constantly carrying out DNS, thereby have improved the running efficiency of collecting web page.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do one to the accompanying drawing of required use in embodiment or the description of the Prior Art below introduces simply, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the process flow diagram of an embodiment of method of collecting web page of the present invention;
Fig. 2 is the structural drawing of an embodiment of system of collecting web page of the present invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and the embodiments.Should be appreciated that embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
Please refer to Fig. 1, be the process flow diagram of the embodiment of method of collecting web page of the present invention, its detailed process comprises:
Step 101: the inlet URL of system's operation is set, is kept in the url database;
Step 102:URL input control desk obtains URL one by one from url database, and resolves corresponding host name according to URL;
Step 103:URL input control desk sends the host name asks control desk to DNS;
Step 104:DNS request control desk mates the host name received and the host name Hash table (HostName Hash Table) of its maintenance, judge whether described host name has successfully carried out the DNS request, if then carry out step 108, if not, then carry out step 105;
Preserved the DNS request results of host name correspondence in the described host name Hash table.If the match is successful therefore, IP information result that this host name correspondence has been arranged in the present DNS database then has been described, this URL can be sent to webpage this moment and obtain control desk for obtaining, if not success, illustrate that then this host name also was not requested at present, but perhaps once-requested is crossed and is made a mistake etc.
Described host name is sent to DNS result treatment unit (DNSresult collection) with step 105:DNS request control desk and the blacklist of its maintenance mates, and judges whether that the match is successful, if, then carry out step 106, if not, then carry out step 107;
Step 106: abandon described host name;
Step 107: confirm that described host name need carry out DNS request, by DNS request control desk described host name is sent to DNS sub-element main frame and carries out the DNS request, and the URL of described host name correspondence is sent to url database waits for next time and be acquired.
DNS request control desk carries out DNS request detailed process:
1, DNS request control desk distributes DNS request task to give DNS sub-element main frame (ADNS, Asynchronous Domain Name System);
Comprise a task queue in the DNS request control desk, when URL input control desk sends to DNS request control desk with the host name, the host name can be inserted in this formation.DNS request control desk is responsible for the host name in the formation is distributed to subordinate's DNS sub-element main frame, so this module only plays the scheduling effect, and does not directly initiate the DNS request.For suitable allocating task, can have following two kinds of methods to realize:
First method: distribute DNS request task according to the system resource occupancy of DNS sub-element main frame;
DNS request control desk can monitor the CPU, internal memory, Thread Count, network occupancy of subordinate DNS sub-element main frame etc., when the system resource occupancy of DNS sub-element main frame is higher, does not then give this DNS sub-element host assignment DNS request task; If when the resource occupation rate is enough to initiate a DNS request, then give the corresponding task of this DNS sub-element host assignment.
Second method: the hash value to the host name is carried out complementation, carries out allocating task according to operation result.
DNS request control desk carries out Hash operation one time to each host name, hash value to the host name is carried out complementation, be num=(Hash mod N), N is the number of DNS sub-element main frame, the num numerical value of trying to achieve like this is DNS sub-element main frame numbering, is about to this host name and distributes to num platform DNS sub-element main frame and carry out the DNS request.
2, DNS sub-element main frame carries out the DNS request, and the result of DNS request is sent to DNS result treatment unit;
The quantity of DNS sub-element main frame can be determined according to the size of system, is three main frames generally speaking, installation code ADNS storehouse on main frame, and design is initiated the program of request DNS and is called the interface in this storehouse.Owing to be asynchronous DNS request, this main frame can needn't wait for that the result returns and then initiate DNS request next time, thereby realize multithreading DNS request that DNS result's efficient is obtained in raising after initiation is once asked.
3, DNS result treatment unit manages the result of DNS request.
The result that each DNS sub-element main frame of DNS result treatment element analysis returns, the successful DNS request results of request is kept in the DNS database, to send to DNS mismanage unit owing to the DNS request that the request that a variety of causes causes is failed, and in order to upgrade host name Hash table;
DNS result treatment unit also can be expanded, and correct DNS result is carried out analysis and judgement, to the result different priority is set, and deposits database in.DNS fault management modules (Manage DNS errorHost) can also be classified according to error category to the host name of not returning correct DNS result, safeguard a blacklist that repeatedly can't return correct result's host name composition, take place to prevent certain host name still requested situation after repeatedly request is all failed.
Step 108:DNS request control desk will be asked successful DNS to ask pairing URL to be sent to webpage and be obtained control desk;
Step 109: simultaneously, webpage obtains control desk according to host name among the URL and described DNS request results, and for example, the pairing IP of this host name address obtains the page;
Whether webpage obtains control desk and safeguards a URL Hash table, successfully obtained to judge the specified page of URL.Described URL Hash table has been used to preserve the situation that the specified page of URL is acquired.
Step 109 and abovementioned steps 104-step 108 are carried out simultaneously, and promptly when DNS request control desk carries out the DNS request one by one, webpage obtains control desk and also obtains carrying out the page according to the successful ULR of DNS request.
Webpage obtains control desk to carry out the page according to described host name and obtains detailed process and comprise:
1, webpage obtains control desk is inquired about this URL correspondence in the DNS database according to the host name of URL correspondence DNS result;
If in the DNS database, can't find the DNS result of corresponding host name, the DNS result that this host name correspondence then is described is expired, need to initiate again request, can turn back to this URL the URL formation this moment, the host name of this URL is sent to the expired administrative unit of DNS, this host name is upgraded in hostname Hash table, represented that this host name does not have corresponding correct DNS result at present in database, need ask dns server again to obtain the result.
2, the webpage DNS result that obtains the host name correspondence that control desk obtains URL and inquiry is dispensed to webpage and obtains the sub-element main frame;
Webpage obtains the control desk allocating task and obtains the sub-element main frame to webpage, can have following two kinds of methods to realize:
First method: the system resource occupancy URL and the corresponding DNS result that obtain the sub-element main frame according to webpage;
Webpage obtains the internal memory that control desk can monitor subordinate's sub-element, Thread Count, network occupancy etc., when webpage obtains the system resource occupancy of sub-element main frame when higher, then do not give this host assignment task,, then distribute corresponding task if when the resource occupation rate is sufficient;
Second method: the hash value to described URL is carried out complementation, carries out allocating task according to operation result
Webpage obtains control desk each URL is carried out Hash operation one time, hash value to this URL is carried out complementation, be num=(Hash mod N), N is the number that webpage obtains the sub-element main frame, the num numerical value of trying to achieve like this is webpage and obtains sub-element main frame numbering, is about to this URL and distributes to num platform webpage and obtain the sub-element main frame and obtain.
3, webpage obtains the resource that the sub-element main frame obtains the URL appointment.
Webpage obtains the quantity of sub-element main frame and can determine according to the size of system, be three main frames generally speaking, the main initiation HTTP GET that is responsible for asks, obtain page HTML code, and the page HTML code of successfully obtaining is sent to HTML code preservation unit and HTML resolution unit respectively; When obtaining page HTML code also various mistakes might appear, at this moment, the URL information that makes a mistake need be sent to URL Hash table, the information of this URL is not changed to successfully obtain, successfully obtain to guarantee in this URL acquisition process afterwards, can not takeed for.
Step 110:HTML code is preserved the unit and is preserved HTML code;
HTML code is preserved the unit page HTML code information and the corresponding URL information that successfully obtain is kept in the html data storehouse with reasonable manner, sets up suitable index, uses in order to inquiry.
Step 111:HTML resolution unit is resolved HTML code, extracts URL;
The HTML resolution unit is extracted the information such as hyperlink in the page HTML code of successfully obtaining, and after will extracting the result and carrying out verification of correctness, is kept in the url database.In leaching process, can be at the label in the HTML code, such as<A〉href in the tag attributes etc.,<AREA〉the location attribute etc. of label extracts URL information, then these URL are verified, to guarantee the requirement of obtaining of these URL compliance with system, mainly be to judge that whether legal whether legal the suffix of host name ending and filename etc.If verify that this URL is legal, then preserve, if illegal, then abandon this URL.
Step 112:ULR database is preserved the URL information that extracts.
The method of the collecting web page that embodiment provides among the present invention, do not disturb mutually between each functional unit and do not conflict, when obtaining page HTML code, also constantly initiating the DNS request, thereby guaranteeing that system high-speed turns round efficiently according to the successful URL of DNS request.
The present invention also provides another embodiment of collecting web page method, and it specifically comprises:
From url database, obtain URL one by one, and obtain corresponding main frame host name according to URL;
According to described host name, carry out DNS DNS request;
When carrying out DNS DNS request,, carry out the page and obtain according to the successful URL of DNS request.
Please refer to Fig. 2, be the structural drawing of a collecting web page embodiment of system of the present invention.The system of described collecting web page comprises that URL input control desk 21, DNS Request Processing device 22, webpage obtain treating apparatus 23.
Described URL input control desk 21 is used for obtaining URL one by one from url database, and resolves corresponding host name according to the described URL that obtains;
Described DNS Request Processing device 22 is used for according to described host name, carries out the DNS request, and the concurrent URL that refers to the DNS request of the merit of hoping for success obtains treating apparatus 23 for described webpage;
Described webpage obtains treating apparatus 23, is used for when described DNS Request Processing device 22 carries out the DNS request, and the successful URL of DNS request according to receiving from DNS Request Processing device 22 carries out the page and obtains.
Described DNS Request Processing device 22 further comprises:
DNS asks control desk 221, is used for according to described host name, distributes DNS request task, and the concurrent URL that refers to the DNS request of the merit of hoping for success imports control desk to URL;
Described DNS request control desk 221 is maintenance host name Hash table 2211 also, has preserved the DNS request results of host name correspondence in the described host name Hash table 2211.DNS request control desk 221 mates the host name Hash table (Host Name Hash Table) of host name and its maintenance, judges whether described host name has successfully carried out the DNS request.
DNS sub-element main frame 222 is used for carrying out the DNS request according to the DNS request task that DNS request control desk 221 distributes.
Described DNS Request Processing device 22 can also comprise:
DNS result treatment unit 223 is used for the DNS request results that DNS sub-element main frame 222 returns is analyzed and managed.
Described DNS Request Processing device 22 also comprises:
DNS database 224, preserving by 223 analyses of DNS result treatment unit is the successful DNS request results of request;
DNS mismanage unit 225, preserving by 223 analyses of DNS result treatment unit is the DNS request results of request failure.
Described DNS Request Processing device 22 also comprises:
The expired administrative unit 226 of DNS to the DNS request results regular update in the described DNS database 224, is deleted expired DNS request results.
Described webpage obtains treating apparatus 23 and further comprises:
Webpage obtains control desk 231, is used for the DNS result according to described host name inquiry URL correspondence, and distributes described URL and corresponding DNS result;
Whether described webpage obtains control desk 231 and safeguards a URL Hash table 2311, successfully obtained to judge the specified page of described URL, and described URL Hash table is used to preserve the situation that the specified page of URL is acquired.
Webpage obtains sub-element main frame 232, is used for obtaining URL and the corresponding DNS result that control desk 231 distributes according to webpage, obtains page HTML code.
Described webpage obtains treating apparatus 23 and also comprises:
HTML code is preserved unit 233, is used to preserve webpage and obtains the page HTML code that sub-element main frame 232 obtains;
Html data storehouse 234 is used for preserving webpage and obtains the correct HTML code of page HTML code that sub-element main frame 232 obtains;
HTML resolution unit 235 is used for analyzing web page and obtains the page HTML code that sub-element main frame 232 obtains, and extracts URL;
ULR database 236 is used to preserve the URL information that HTML resolution unit 235 extracts.
By collecting web page method and the system thereof that the embodiment of the invention provides, the DNS request and the page obtain respectively and carry out simultaneously, therefore when obtaining page code, are also constantly carrying out the DNS request, thereby are improving the running efficiency of collecting web page.
One of ordinary skill in the art will appreciate that all or part of step that realizes in the foregoing description method is to instruct relevant hardware to finish by program, described program can be stored in the computer read/write memory medium, and described storage medium is ROM/RAM, magnetic disc, CD etc.
More than a kind of collecting web page method provided by the present invention and system thereof are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used to help to understand disclosed technical scheme; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (17)

1. the method for a collecting web page is characterized in that, comprising:
From url database, obtain URL one by one, and obtain corresponding main frame host name according to URL;
According to described host name, carry out DNS DNS request, the result that DNS is asked manages, and described management comprises and will ask successful DNS request results to be kept in the DNS database;
When carrying out DNS DNS request,, carry out the page and obtain according to the successful URL of DNS request in the described DNS database.
2. the method for collecting web page according to claim 1 is characterized in that, according to described host name, carries out also comprising before the DNS request:
DNS request control desk judges whether described host name has successfully carried out the DNS request, if then will ask successful DNS to ask pairing URL to be sent to webpage and obtain treating apparatus; If not, then carry out the DNS request according to described host name.
3. the method for collecting web page according to claim 1 is characterized in that, according to described host name, carries out also comprising after the DNS request:
The DNS request results that the storage request is successful.
4. the method for collecting web page according to claim 1 is characterized in that, according to described host name, carries out the DNS request and specifically comprises:
DNS request control desk distributes DNS request task to give DNS sub-element main frame;
DNS sub-element main frame carries out the DNS request;
DNS result treatment unit is analyzed and is managed the result of DNS request.
5. the method for collecting web page according to claim 4 is characterized in that, DNS request control desk distributes DNS request task specifically to comprise for DNS sub-element main frame:
Distribute DNS request task according to the system resource occupancy of DNS sub-element main frame; Perhaps
Hash value to described host name is carried out complementation, distributes DNS request task according to operation result.
6. the method for collecting web page according to claim 4 is characterized in that, the result that ask DNS DNS result treatment unit analyzes and management is specially:
Analyze the result that each DNS sub-element main frame returns, ask successful DNS request results to be kept in the DNS database, ask the host name of failure to send to DNS mismanage unit DNS and handle.
7. the method for collecting web page according to claim 1 is characterized in that, according to the successful URL of DNS request, carries out the page and obtains specifically and comprise:
Webpage obtains control desk is inquired about the URL correspondence in described DNS database according to described host name DNS result;
Webpage obtains control desk and the DNS result of host name and the URL correspondence that obtains of inquiry is dispensed to webpage obtains the sub-element main frame;
Webpage obtains the sub-element main frame and obtains page HTML code.
8. the method for collecting web page according to claim 7, it is characterized in that, if webpage obtains control desk does not inquire the URL correspondence according to described host name DNS result, then with the formation of described URL Return URL, the host name of described URL is sent to the expired administrative unit of DNS, ask dns server to obtain DNS result again.
9. the method for collecting web page according to claim 7 is characterized in that, webpage obtains control desk and the DNS result of URL and the host name correspondence that obtains of inquiry is dispensed to webpage obtains the sub-element main frame and specifically comprise:
Obtain the system resource occupancy of sub-element main frame according to webpage and distribute URL and corresponding DNS result; Perhaps
Hash hash value to described URL is carried out complementation, distributes URL and corresponding DNS result according to operation result.
10. the method for collecting web page according to claim 7 is characterized in that, described webpage obtains and also comprises after the sub-element main frame obtains page HTML code:
HTML code is preserved the unit and is preserved HTML code;
The HTML resolution unit is resolved HTML code, and extracts URL;
Url database is preserved the URL information that extracts.
11. the system of a collecting web page is characterized in that, comprising:
URL imports control desk, is used for obtaining URL one by one from url database, and resolves corresponding host name according to URL;
DNS Request Processing device, be used for according to described host name, carry out the DNS request, the result that DNS is asked manages, described management comprises and will ask successful DNS request results to be kept in the DNS database, and sends the URL of the DNS request that request is successful in the described DNS database;
Webpage obtains treating apparatus, is used for when described DNS Request Processing device carries out the DNS request, and the successful URL of DNS request according to receiving from DNS Request Processing device carries out the page and obtains.
12. the system of collecting web page according to claim 11 is characterized in that, described DNS Request Processing device comprises:
DNS asks control desk, is used for the name according to main frame host, distributes DNS request task, and the concurrent uniform resource position mark URL of referring to the DNS request of the merit of hoping for success is imported control desk to URL;
DNS sub-element main frame is used for carrying out the DNS request according to the DNS request task that DNS request control desk distributes.
13. the system of collecting web page according to claim 12 is characterized in that, described DNS Request Processing device also comprises:
DNS result treatment unit is used for the DNS request results that DNS sub-element main frame returns is analyzed and managed.
14. the system of collecting web page according to claim 13 is characterized in that, described DNS Request Processing device also comprises:
The DNS database, preserving by DNS result treatment element analysis is the successful DNS request results of request;
DNS mismanage unit, handling by DNS result treatment element analysis is the DNS request results of request failure.
15. the system of collecting web page according to claim 14 is characterized in that, described DNS Request Processing device also comprises:
The expired administrative unit of DNS to the DNS request results regular update in the described DNS database, is deleted expired DNS request results.
16. the system of collecting web page according to claim 11 is characterized in that, described webpage obtains treating apparatus and comprises:
Webpage obtains control desk, is used for according to the DNS DNS result of main frame host name in described DNS database inquiry uniform resource position mark URL correspondence, and distributes URL and corresponding DNS result;
Webpage obtains the sub-element main frame, is used for obtaining URL and the corresponding DNS result that control desk distributes according to webpage, obtains page HTML code.
17. the system of collecting web page according to claim 16 is characterized in that, described webpage obtains treating apparatus and also comprises:
HTML code is preserved the unit, is used to preserve webpage and obtains the page HTML code that the sub-element main frame obtains;
The html data storehouse is used for preserving webpage and obtains the correct HTML code of page HTML code that the sub-element main frame obtains;
The HTML resolution unit is used for analyzing web page and obtains the page HTML code that the sub-element main frame obtains, and extracts URL;
Url database is used to preserve the URL information that the HTML resolution unit extracts.
CN2008101112988A 2008-06-13 2008-06-13 Method and system for collecting web page Expired - Fee Related CN101303700B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101112988A CN101303700B (en) 2008-06-13 2008-06-13 Method and system for collecting web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101112988A CN101303700B (en) 2008-06-13 2008-06-13 Method and system for collecting web page

Publications (2)

Publication Number Publication Date
CN101303700A CN101303700A (en) 2008-11-12
CN101303700B true CN101303700B (en) 2010-04-21

Family

ID=40113603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101112988A Expired - Fee Related CN101303700B (en) 2008-06-13 2008-06-13 Method and system for collecting web page

Country Status (1)

Country Link
CN (1) CN101303700B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102316099B (en) * 2011-07-28 2014-10-22 中国科学院计算机网络信息中心 Network fishing detection method and apparatus thereof
CN109347996A (en) * 2018-12-10 2019-02-15 中共中央办公厅电子科技学院 A kind of DNS domain name acquisition system and method
CN110891090B (en) * 2019-11-29 2023-01-31 北京声智科技有限公司 Request method, device, server, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046820A (en) * 2006-03-29 2007-10-03 国际商业机器公司 System and method for prioritizing websites during a webcrawling process
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046820A (en) * 2006-03-29 2007-10-03 国际商业机器公司 System and method for prioritizing websites during a webcrawling process
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朴星海.面向主题的网络爬行器相关技术研究.哈尔滨工业大学工学硕士学位论文.2007,8-15,38-45. *
苏旋.分布式网络爬虫技术的研究与实现.哈尔滨工业大学工学硕士学位论文.2006,3-4,16-19,28-31. *

Also Published As

Publication number Publication date
CN101303700A (en) 2008-11-12

Similar Documents

Publication Publication Date Title
CN107895009B (en) Distributed internet data acquisition method and system
US10839038B2 (en) Generating configuration information for obtaining web resources
US20180191811A1 (en) Distributed server systems and data processing methods
CN102819591B (en) A kind of content-based Web page classification method and system
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
CN105243159A (en) Visual script editor-based distributed web crawler system
CN101355587B (en) Method and apparatus for obtaining URL information as well as method and system for implementing searching engine
CN102761628B (en) Pan-domain name identification and processing device and method
CN103853743A (en) Distributed system and log query method thereof
CN102710795A (en) Hotspot collecting method and device
CN110855766A (en) Method and device for accessing Web resources and proxy server
CN102833233A (en) Method and device for recognizing web pages
CN105930502B (en) System, client and method for collecting data
CN110808868B (en) Test data acquisition method and device, computer equipment and storage medium
CN104239353B (en) WEB classification control and log audit method
CN103778908A (en) Karaoke member VOD system and VOD method thereof
CN103248707B (en) File access method, system and equipment
CN107911466A (en) A kind of association method under multi-layer framework
CN102999424B (en) Parallel remote automated testing method
CN101303700B (en) Method and system for collecting web page
CN103513986B (en) A kind of method utilizing CGI technology to realize dynamic web server in without operating system equipment
CN102917067A (en) Method and device for increasing response speed based on self-adaption concurrency control of client
CN1249608C (en) System and method of mediating web page
CN110737645A (en) data migration method between different systems, data migration system and related equipment
CN103647774A (en) Web content information filtering method based on cloud computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: CHENGDU CITY HUAWEI SAIMENTEKE SCIENCE CO., LTD.

Free format text: FORMER OWNER: HUAWEI TECHNOLOGY CO., LTD.

Effective date: 20090424

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20090424

Address after: Qingshui River District, Chengdu high tech Zone, Sichuan Province, China: 611731

Applicant after: CHENGDU HUAWEI SYMANTEC TECHNOLOGIES Co.,Ltd.

Address before: Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Province, China: 518129

Applicant before: HUAWEI TECHNOLOGIES Co.,Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee

Owner name: HUAWEI DIGITAL TECHNOLOGY (CHENGDU) CO., LTD.

Free format text: FORMER NAME: CHENGDU HUAWEI SYMANTEC TECHNOLOGIES CO., LTD.

CP01 Change in the name or title of a patent holder

Address after: 611731 Chengdu high tech Zone, Sichuan, West Park, Qingshui River

Patentee after: HUAWEI DIGITAL TECHNOLOGIES (CHENG DU) Co.,Ltd.

Address before: 611731 Chengdu high tech Zone, Sichuan, West Park, Qingshui River

Patentee before: CHENGDU HUAWEI SYMANTEC TECHNOLOGIES Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100421