CN106202077B - Task distribution method and device - Google Patents

Task distribution method and device Download PDF

Info

Publication number
CN106202077B
CN106202077B CN201510217232.7A CN201510217232A CN106202077B CN 106202077 B CN106202077 B CN 106202077B CN 201510217232 A CN201510217232 A CN 201510217232A CN 106202077 B CN106202077 B CN 106202077B
Authority
CN
China
Prior art keywords
page
url
thread
crawled
hash value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510217232.7A
Other languages
Chinese (zh)
Other versions
CN106202077A (en
Inventor
左啸冰
罗纯杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Institute of Computing Technology of CAS
Original Assignee
Huawei Technologies Co Ltd
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd, Institute of Computing Technology of CAS filed Critical Huawei Technologies Co Ltd
Priority to CN201510217232.7A priority Critical patent/CN106202077B/en
Publication of CN106202077A publication Critical patent/CN106202077A/en
Application granted granted Critical
Publication of CN106202077B publication Critical patent/CN106202077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The embodiment of the invention discloses a task distribution method and a task distribution device, which are used for improving the crawling efficiency of a web crawler. The method provided by the embodiment of the invention comprises the following steps: analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list; judging whether the URL of the first page is crawled or not; when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page; when the first page and the second page are determined not to be adjacent pages, allocating the first page to a new thread; when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located; and controlling to execute the downloading task on the distributed threads according to the time sequence.

Description

Task distribution method and device
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for task distribution.
Background
In the web crawler technology, how to obtain the content of a target URL (Uniform Resource Locator) web page set in batch is a big problem, and the web crawler itself is not mistaken as a malicious web crawler by a server where the target web page set is located. The web crawler needs to start multiple downloading threads to concurrently send access requests to the web pages in the URL task list to ensure the efficiency of crawling the web pages. However, the target website server may determine whether the address request is sent by a malicious web crawler by detecting a URL access request sequence and a URL access request frequency sent by the same IP (Internet Protocol) address. If the page set in the URL access request sequence of a certain IP address in a certain short time period conforms to the definition of the website on the adjacent pages, the IP address is determined to run a malicious web crawler. To avoid this, a variety of techniques or strategies are now being employed.
A distributed web crawler technique for sorting task lists using priority credits is provided in the prior art. When a distributed web crawler is adopted to crawl a webpage, 1 fixed IP address is set for each node, the maximum value of the number of threads concurrently executed by each node is set to be 3 or less, and the time sequence control of URL (uniform resource locator) requests is executed on each thread; meanwhile, a list of data items named as points is added in the task list, the points are determined according to the analysis of the URL or the content of the webpage, and the task queue is reordered from large to small according to the points. With this method, the amount of required IP address resources is greatly reduced. Meanwhile, due to the fact that the task list is rearranged, the access sequence of adjacent pages with the same father page is possibly disturbed, the maximum concurrent crawling thread number of each IP address is 3, the maximum number of sent access requests is 3, the access requests are not too dense due to time sequence control, and therefore the access requests are not easily identified as malicious web crawlers.
By adopting the method, the number of concurrent threads of each node is too small, so that the crawling efficiency (the number of crawled web pages in unit time) of the web crawler is very low; and there is a possibility that adjacent pages are intensively accessed in the same period of time to be identified as a malicious web crawler.
Disclosure of Invention
The invention provides a task distribution method and a task distribution device, which are used for improving the crawling efficiency of a web crawler.
The invention provides a task distribution method in a first aspect, which comprises the following steps:
analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
judging whether the URL of the first page is crawled or not;
when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page;
when the first page and the second page are determined not to be adjacent pages, allocating the first page to a new thread;
when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located;
and controlling to execute the downloading task on the distributed threads according to the time sequence.
With reference to the first aspect of the present invention, in a first implementation manner of the first aspect of the present invention, the determining whether the URL of the first page has been crawled specifically includes:
and judging whether the URL of the first page is crawled or not according to the URL list of the crawled page, and deleting the URL of the first page from the task list if the URL of the first page is crawled.
With reference to the first aspect of the present invention, in a second implementation manner of the first aspect of the present invention, the determining, according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page, whether the first page and the second page are adjacent pages specifically includes:
judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, and judging whether the hash value of the father page of the URL of the first page is the same as the hash value of the father page of the URL of the second page;
when the sub domain names are the same or the hash values are the same, determining that the first page and the second page are adjacent pages;
and when the sub domain names are different and the hash values are not equal, determining that the first page and the second page are not adjacent pages.
With reference to the first aspect of the present invention, or the first implementation manner of the first aspect, or the second implementation manner of the first aspect, in a third implementation manner of the first aspect of the present invention, the allocating the first page to a new thread specifically includes:
distributing the first page to a new thread, and writing a thread number corresponding to the new thread into a thread number item of the URL of the first page; the new thread saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and deleting the URL of the first page from the task list, and writing the URL of the first page into a URL list of the crawled pages.
With reference to the first aspect of the present invention, or the first implementation manner of the first aspect, or the second implementation manner of the first aspect, in a fourth implementation manner of the first aspect of the present invention, the allocating the first page to the thread in which the second page is located specifically includes:
distributing the first page to the thread where the second page is located, and writing the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and deleting the URL of the first page from the task list, and writing the URL of the first page into a URL list of the crawled pages.
A second aspect of the present invention provides a task distributing apparatus, including:
the system comprises an analyzing unit, a task list and a processing unit, wherein the analyzing unit is used for analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into the task list;
the judging unit is used for judging whether the URL of the first page is crawled or not;
a determining unit, configured to determine, when the URL of the first page is not crawled, whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page;
the first allocation unit is used for allocating the first page to a new thread after the first page and the second page are determined not to be adjacent pages;
the second allocating unit is used for allocating the first page to a thread where the second page is located after the first page and the second page are determined to be adjacent pages;
and the execution unit is used for executing the downloading task on the distributed threads according to the time sequence control.
With reference to the second aspect of the present invention, in a first implementation manner of the second aspect of the present invention, the determining unit is specifically configured to determine whether the URL of the first page has been crawled according to a URL list of crawled pages, and if so, delete the URL of the first page from the task list.
With reference to the second aspect of the present invention, in a second implementation manner of the second aspect of the present invention, the determining unit specifically includes:
the judging module is used for judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page or not and judging whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page or not when the URL of the first page is not crawled;
a first determining module, configured to determine that the first page and the second page are adjacent pages when the sub domain names are the same or the hash values are equal;
and the second determining module is used for determining that the first page and the second page are not adjacent pages when the sub domain names are different and the hash values are not equal.
With reference to the second aspect of the present invention, or the first implementation manner of the second aspect, or the second implementation manner of the second aspect, in a third implementation manner of the second aspect of the present invention, the first allocating unit specifically includes:
the first distribution module is used for distributing the first page to a new thread and writing a thread number corresponding to the new thread into a thread number item of the URL of the first page; the new thread saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and the first deleting module is used for deleting the URL of the first page from the task list and writing the URL of the first page into the URL list of the crawled page.
With reference to the second aspect of the present invention, or the first implementation manner of the second aspect, or the second implementation manner of the second aspect, in a fourth implementation manner of the second aspect of the present invention, the second allocating unit specifically includes:
the second distribution module is used for distributing the first page to the thread where the second page is located and writing the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and the second deleting module is used for deleting the URL of the first page from the task list and writing the URL of the first page into the URL list of the crawled page.
According to the technical scheme, the embodiment of the invention has the following advantages: analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list; judging whether the URL of the first page is crawled or not; when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page; when the first page and the second page are determined not to be adjacent pages, the first page is distributed to a new thread, and a downloading task is executed according to time sequence control; and when the first page and the second page are determined to be adjacent pages, distributing the first page to a thread where the second page is located, and executing a downloading task according to time sequence control. Because the URL tasks of the distributed web crawlers are reasonably distributed, the adjacent web pages are distributed to the same thread, each adjacent web page distributed at the same IP address is ensured to be serially downloaded, and non-adjacent web pages are distributed to different threads to be downloaded concurrently; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
Drawings
FIG. 1 is a flowchart illustrating a task distribution method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a task distribution method according to another embodiment of the present invention;
FIG. 3 is a flowchart illustrating a task distribution method according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating a task distribution method according to another embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an embodiment of a task distribution apparatus provided in the present invention;
FIG. 6 is a schematic structural diagram of another embodiment of a task distribution apparatus provided in the present invention;
FIG. 7 is a schematic structural diagram of another embodiment of a task distribution apparatus provided in the present invention;
FIG. 8 is a schematic structural diagram of another embodiment of a task distribution apparatus provided in the present invention;
fig. 9 is a schematic structural diagram of another embodiment of a task distribution apparatus provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is to be understood that, although the terms first, second, etc. may be used to describe various users or terminals in the embodiments of the present invention, the users or terminals should not be limited by these terms. These terms are only used to distinguish one user or terminal from another. For example, a first user may also be referred to as a second user, and similarly, a second user may also be referred to as a first user, without departing from the scope of embodiments of the present invention; likewise, the second user may also be referred to as a third user, etc., and the embodiment of the present invention is not limited thereto.
Some abbreviations and key terms to which the invention relates are first defined:
web crawler, a program or script for automatically capturing web information according to a certain rule.
URL: on the WWW, each information Resource has a Uniform and unique address on the network, called URL (Uniform Resource Locator), which is a Uniform Resource Locator of the WWW, i.e. a network address.
Adjacent pages: the definition of adjacent pages of each website is different, and generally includes: different pages respectively pointed by a plurality of URL links contained in the same webpage and pages with the same sub domain name (for example: http:// www.very.com/entries/793336/and http:// www.very.com/entries/792442 /).
Sequential control in crawler threads: and setting a time interval of URL access requests sent by the web crawler downloading thread to the target website, wherein the time interval is greater than the time interval required by a person for clicking a webpage and must be a random value.
An IP address; each internet-connected host is assigned a 32-bit address.
An embodiment of the present invention provides a task distribution method, which is a method executed by a task distribution device, and referring to fig. 1, an embodiment of the task distribution method provided by the present invention includes:
101. analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
the parser is responsible for transmitting data types to the task list, specifically parsing the acquired uniform resource locator URL of the first page and the hash value of the parent page of the URL of the first page into the task list; thus, the data type is no longer just the parsed URL string, but rather contains a binary array of the URL string and the hash value of the parent page of the URL. It will be appreciated that in the task list, the task list is also formatted by listing only URL strings, but instead by listing not only URL strings but also hash values that identify the parent page of the URL.
102. Judging whether the URL of the first page is crawled or not;
it should be noted that, when a new URL task enters the task waiting queue in the task list, the deduplication processing is performed first; therefore, it is first necessary to determine whether the URL of the first page has been crawled.
103. When the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page;
when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page; and the second page is a page which is different from the first page and is in the running thread.
104. When the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread;
and when the first page and the second page are determined not to be adjacent pages, allocating the first page to a new thread.
105. When the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located;
and when the first page and the second page are determined to be adjacent pages, distributing the first page to the thread where the second page is located.
106. And controlling to execute the downloading task on the distributed threads according to the time sequence.
In the embodiment of the invention, the Uniform Resource Locator (URL) of a first page and the hash value of a parent page of the URL of the first page are analyzed into a task list; judging whether the URL of the first page is crawled or not; when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to the sub domain name in the URL of the first page and the hash value of the parent page of the URL of the first page; when the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread; and when the first page and the second page are determined to be adjacent pages, the first page is distributed to the thread where the second page is located, and the distributed thread is controlled to execute a downloading task according to the time sequence. Because the URL tasks of the distributed web crawlers are reasonably distributed, the adjacent web pages are distributed to the same thread, each adjacent web page distributed at the same IP address is ensured to be serially downloaded, and non-adjacent web pages are distributed to different threads to be downloaded concurrently; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
Referring to fig. 2, another embodiment of the task distribution method provided by the present invention includes:
201. analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
202. judging whether the URL of the first page is crawled or not according to the URL list of the crawled pages, and deleting the URL of the first page from the task list if the URL of the first page is crawled;
it should be noted that, whether the URL of the first page has been crawled is determined according to the URL list of the crawled page, and if the URL of the first page has been crawled, the URL of the first page is deleted from the task list.
203. When the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page;
204. when the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread;
205. when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located;
it should be noted that, the specific processes of steps 201, 203 to 205 may respectively correspond to steps 101, 103 to 105 in the embodiment shown in fig. 1, and are not described herein again.
206. And controlling to execute the downloading task on the distributed threads according to the time sequence.
It should be noted that, the web crawler download thread executes the download task according to the time sequence control, analyzes the downloaded page content, sends the analyzed new URL and the hash value of the URL parent page to the URL task list, and generates a new URL task waiting queue after deduplication.
In the embodiment of the invention, whether the URL of the first page is crawled is judged according to the URL list of the crawled page, if so, the URL of the first page is deleted from the task list, so that the crawling efficiency of the web crawler is improved.
Referring to fig. 3, another embodiment of the task distribution method provided by the present invention includes:
301. analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
302. judging whether the URL of the first page is crawled or not according to the URL list of the crawled pages, and deleting the URL of the first page from the task list if the URL of the first page is crawled;
303. when the URL of the first page is not crawled, judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page or not, and judging whether the hash value of the father page of the URL of the first page is the same as the hash value of the father page of the URL of the second page or not;
it should be noted that, when the URL of the first page is not crawled, it is determined whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, and it is determined whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page; and comparing the sub domain name of the URL of the first page and the hash value of the parent page of the URL of the first page with other pages in all running threads respectively, and judging whether the sub domain name of the URL of the first page and the hash value are the same.
304. When the sub domain names are the same or the hash values are equal, determining that the first page and the second page are adjacent pages;
it should be noted that, if it is determined through the step 303 that the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, or the hash value of the parent page of the URL of the first page is equal to the hash value of the parent page of the URL of the second page, it is determined that the first page and the second page are adjacent pages.
305. When the sub domain names are different and the hash values are not equal, determining that the first page and the second page are not adjacent pages;
it should be noted that, if it is determined through the step 303 that the sub domain name of the URL of the first page is not the same as the sub domain name of the URL of the second page, and the hash value of the parent page of the URL of the first page is not equal to the hash value of the parent page of the URL of the second page, it is determined that the first page and the second page are not adjacent pages.
306. When the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread;
307. when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located;
308. and controlling to execute the downloading task on the distributed threads according to the time sequence.
It should be noted that, the specific processes of steps 301, 302, 306 to 308 can respectively correspond to steps 201, 202, 204 to 206 in the embodiment shown in fig. 2, and are not described herein again.
In the embodiment of the invention, when the URL of the first page is not crawled, whether the first page and the second page are adjacent pages is determined by judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page and judging whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page; when the sub domain names are the same or the hash values are the same, determining that the first page and the second page are adjacent pages; and when the sub domain names are different and the hash values are not equal, determining that the first page and the second page are not adjacent pages. Therefore, the URL tasks of the distributed web crawlers are reasonably distributed, and the crawling efficiency of the web crawlers is improved.
Referring to fig. 4, another embodiment of the task distribution method provided by the present invention includes:
401. analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
402. judging whether the URL of the first page is crawled or not according to the URL list of the crawled pages, and deleting the URL of the first page from the task list if the URL of the first page is crawled;
403. when the URL of the first page is not crawled, judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page or not, and judging whether the hash value of the father page of the URL of the first page is the same as the hash value of the father page of the URL of the second page or not;
404. when the sub domain names are the same or the hash values are equal, determining that the first page and the second page are adjacent pages;
405. when the sub domain names are different and the hash values are not equal, determining that the first page and the second page are not adjacent pages;
406. when the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread, and a thread number corresponding to the new thread is written into a thread number item of the URL of the first page; deleting the URL of the first page from the task list, and writing the URL into a URL list of crawled pages; the new thread stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
407. when the first page and the second page are determined to be adjacent pages, the first page is distributed to the thread where the second page is located, and the thread number corresponding to the thread where the second page is located is written in the thread number item of the URL of the first page; deleting the URL of the first page from the task list, and writing the URL into a URL list of crawled pages; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
408. and controlling to execute the downloading task on the distributed threads according to the time sequence.
It should be noted that, the specific processes of steps 401 to 405 and 408 can respectively correspond to steps 301 to 305 and 308 in the embodiment shown in fig. 3, and are not described herein again.
In the embodiment of the invention, when the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread, and a thread number corresponding to the new thread is written into a thread number item of a URL (uniform resource locator) of the first page; when the first page and the second page are determined to be adjacent pages, the first page is distributed to the thread where the second page is located, and the thread number corresponding to the thread where the second page is located is written in the thread number item of the URL of the first page; meanwhile, deleting the URL of the first page from the task list, and writing the URL into the URL list of the crawled page; therefore, the same thread is allocated to the adjacent web pages, each adjacent page allocated to the same IP address is ensured to be serially downloaded, and the non-adjacent pages are allocated to different threads to be downloaded concurrently; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
For convenience of understanding, the task distribution method in the embodiment of the present invention is specifically described in a specific application scenario as follows:
sending a request for acquiring a first page to a target webpage server, and simultaneously analyzing a Uniform Resource Locator (URL) of the first page and a hash value of a parent page of the URL of the first page into a task list;
performing deduplication processing on the URL of the first page in the task list, specifically: judging whether the URL of the first page in the task list is crawled or not according to the URL list of the crawled pages, and deleting the URL of the first page from the task list if the URL of the first page is crawled;
when the URL of the first page is not crawled, the first page is specifically distributed according to the following judgment conditions: judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, and judging whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page; the sub domain name of the URL of the first page and the hash value of the parent page of the URL of the first page are compared with other pages in all running threads respectively, and whether the sub domain name of the URL of the first page and the hash value of the parent page of the URL of the first page are the same or not is judged;
the obtained result is processed as follows through the judgment: if the sub domain name of the URL of the first page is judged to be the same as the sub domain name of the URL of the second page, or the hash value of the father page of the URL of the first page is equal to the hash value of the father page of the URL of the second page, determining that the first page and the second page are adjacent pages; at this time, the first page is distributed to the thread where the second page is located, and the thread number corresponding to the thread where the second page is located is written in the thread number item of the URL of the first page; deleting the URL of the first page from the task list, and writing the URL into a URL list of crawled pages; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
if the sub domain name of the URL of the first page is judged to be different from the sub domain name of the URL of the second page, and the hash value of the father page of the URL of the first page is not equal to the hash value of the father page of the URL of the second page, determining that the first page and the second page are not adjacent pages; at this time, the first page is distributed into a new thread, and the thread number corresponding to the new thread is written into the thread number item of the URL of the first page; deleting the URL of the first page from the task list, and writing the URL into a URL list of crawled pages; the new thread stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
after the adjacent pages and the non-adjacent pages are reasonably distributed, the downloading task is executed according to time sequence control through a web crawler downloading thread, the downloaded page content is analyzed, the analyzed new URL and the hash value of the URL parent page are sent to a URL task list, and a new URL task waiting queue is generated after duplication is removed.
The task distributing method is explained above, and the following description is made from the perspective of a task distributing device, which may be specifically integrated in a chip, and the chip may be loaded in a terminal, please refer to fig. 5, and the device includes:
an analyzing unit 501, configured to analyze a uniform resource locator URL of a first page and a hash value of a parent page of the URL of the first page into a task list;
a judging unit 502, configured to judge whether the URL of the first page has been crawled;
a determining unit 503, configured to determine, when the URL of the first page is not crawled, whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page;
a first allocating unit 504, configured to allocate the first page to a new thread after determining that the first page and the second page are not adjacent pages;
a second allocating unit 505, configured to allocate the first page to a thread where the second page is located after determining that the first page and the second page are adjacent pages;
and the execution unit 506 is used for executing the downloading task on the allocated threads according to the time sequence control.
In the embodiment of the present invention, the parsing unit 501 parses a uniform resource locator URL of a first page and a hash value of a parent page of the URL of the first page into a task list; the determining unit 502 determines whether the URL of the first page has been crawled; when the URL of the first page is not crawled, the determining unit 503 determines whether the first page and the second page are adjacent pages according to the sub domain name in the URL of the first page and the hash value of the parent page of the URL of the first page; when it is determined that the first page and the second page are not adjacent pages, the first allocation unit 504 allocates the first page to a new thread; when the first page and the second page are determined to be adjacent pages, the second allocating unit 505 allocates the first page to the thread where the second page is located, and the executing unit 506 executes the downloading task on the allocated thread according to the time sequence control. Because the URL tasks of the distributed web crawlers are reasonably distributed, the adjacent web pages are distributed to the same thread, each adjacent web page distributed at the same IP address is ensured to be serially downloaded, and non-adjacent web pages are distributed to different threads to be downloaded concurrently; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
Based on the task distribution apparatus in the foregoing embodiment, optionally, the determining unit 502 is specifically configured to determine whether the URL of the first page has been crawled according to the URL list of crawled pages, and if so, delete the URL of the first page from the task list, so as to improve the crawling efficiency of the web crawler.
Based on the task distribution apparatus in the foregoing embodiment, optionally, as shown in fig. 6, the determining unit 503 specifically includes:
a determining module 601, configured to determine whether a sub domain name of the URL of the first page is the same as a sub domain name of the URL of the second page and determine whether a hash value of a parent page of the URL of the first page is the same as a hash value of a parent page of the URL of the second page when the URL of the first page is not crawled;
a first determining module 602, configured to determine that the first page and the second page are adjacent pages when the sub domain names are the same or the hash values are equal;
a second determining module 603, configured to determine that the first page and the second page are not adjacent pages when the sub-domain names are different and the hash values are not equal to each other.
In the embodiment of the present invention, when the URL of the first page is not crawled, the determining module 601 determines whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, and determines whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page to determine whether the first page and the second page are adjacent pages; when the sub domain names are the same or the hash values are equal, the first determining module 602 determines that the first page and the second page are adjacent pages; the second determining module 603 determines that the first page and the second page are not adjacent pages when the sub-domain names are different and the hash values are not equal. Therefore, the URL tasks of the distributed web crawlers are reasonably distributed, and the crawling efficiency of the web crawlers is improved.
Based on the task distributing apparatus in the foregoing embodiment, optionally, as shown in fig. 7, the first allocating unit 504 specifically includes:
a first distribution module 701, configured to allocate the first page to a new thread, and write a thread number corresponding to the new thread into a thread number entry of the URL of the first page; the new thread stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
a first deleting module 702, configured to delete the URL of the first page from the task list and write the URL into the URL list of the crawled page.
In this embodiment of the present invention, when it is determined that the first page and the second page are not adjacent pages, the first distribution module 701 allocates the first page to a new thread, and writes a thread number corresponding to the new thread into a thread number entry of a URL of the first page; the first deleting module 702 deletes the URL of the first page from the task list and writes the URL into the URL list of the crawled page; therefore, different threads are allocated to nonadjacent pages for concurrent downloading; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
Based on the task distributing apparatus in the foregoing embodiment, optionally, as shown in fig. 8, the second allocating unit 505 specifically includes:
a second distribution module 801, configured to allocate the first page to a thread where the second page is located, and write a thread number corresponding to the thread where the second page is located in a thread number entry of the URL of the first page; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
a second deleting module 802, configured to delete the URL of the first page from the task list and write the URL into the URL list of the crawled page.
In the embodiment of the present invention, when it is determined that the first page and the second page are adjacent pages, the second distribution module 801 allocates the first page to the thread where the second page is located, and writes the thread number corresponding to the thread where the second page is located in the thread number entry of the URL of the first page; the second deleting module 802 deletes the URL of the first page from the task list and writes the URL into the URL list of the crawled page; therefore, the same thread is allocated to the adjacent web pages, and serial downloading of each adjacent web page allocated to the same IP address is ensured; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
The embodiments shown in fig. 5 to 8 explain the specific structure of the task distribution apparatus from the perspective of the functional unit, and the following describes the specific structure of the task distribution apparatus from the perspective of hardware in conjunction with the embodiment shown in fig. 9:
as shown in fig. 9, the task distributing apparatus includes: a receiver 901, a transmitter 902, a processor 903, and a memory 904.
Embodiments of the present invention may involve task distribution apparatus having more or fewer components than those shown in fig. 9, may involve two or more components, or may involve different arrangements or configurations of components, and each component may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
The processor 903 is used for reading the instructions stored in the memory 904 to execute the following operations:
analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
judging whether the URL of the first page is crawled or not;
when the URL of the first page is not crawled, determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page;
when the first page and the second page are determined not to be adjacent pages, the first page is allocated to a new thread;
when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located;
and controlling to execute the downloading task on the distributed threads according to the time sequence.
The processor 903 is specifically configured to perform the following operations:
and judging whether the URL of the first page is crawled or not according to the URL list of the crawled page, and deleting the URL of the first page from the task list if the URL of the first page is crawled.
The processor 903 is specifically configured to perform the following operations:
judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, and judging whether the hash value of the father page of the URL of the first page is the same as the hash value of the father page of the URL of the second page;
when the sub domain names are the same or the hash values are equal, determining that the first page and the second page are adjacent pages;
and when the sub domain names are different and the hash values are not equal, determining that the first page and the second page are not adjacent pages.
The processor 903 is specifically configured to perform the following operations:
distributing the first page to a new thread, and writing a thread number corresponding to the new thread into a thread number item of the URL of the first page; the new thread stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and deleting the URL of the first page from the task list, and writing the URL into the URL list of the crawled pages.
The processor 903 is specifically configured to perform the following operations:
distributing the first page to the thread where the second page is located, and writing the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and deleting the URL of the first page from the task list, and writing the URL into the URL list of the crawled pages.
In the embodiment of the present invention, the processor 903 parses a uniform resource locator URL of a first page and a hash value of a parent page of the URL of the first page into a task list; carrying out duplicate removal processing on the URL of the first page; determining whether the first page and the second page are adjacent pages according to the sub domain name in the URL of the first page and the hash value of the parent page of the URL of the first page; when the first page and the second page are determined not to be adjacent pages, the first page is distributed to a new thread, and a downloading task is executed according to time sequence control; and when the first page and the second page are determined to be adjacent pages, distributing the first page to the thread where the second page is located, and executing a downloading task according to time sequence control. Because the URL tasks of the distributed web crawlers are reasonably distributed, the adjacent web pages are distributed to the same thread, each adjacent web page distributed at the same IP address is ensured to be serially downloaded, and non-adjacent web pages are distributed to different threads to be downloaded concurrently; therefore, under the condition that the number of the IP is limited, the crawling efficiency of the web crawlers is improved, and the web crawlers with the limited number of the IP can be effectively prevented from being identified as malicious web crawlers.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for task distribution, comprising:
analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into a task list;
judging whether the URL of the first page is crawled or not;
when the URL of the first page is not crawled, determining whether the first page and a second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page, wherein the first page and the second page are pages which are divided at the same IP address;
when the first page and the second page are determined not to be adjacent pages, allocating the first page to a new thread;
when the first page and the second page are determined to be adjacent pages, the first page is distributed to a thread where the second page is located;
and controlling to execute the downloading task on the distributed threads according to the time sequence.
2. The task distribution method of claim 1, wherein the determining whether the URL of the first page has been crawled specifically comprises:
and judging whether the URL of the first page is crawled or not according to the URL list of the crawled page, and deleting the URL of the first page from the task list if the URL of the first page is crawled.
3. The task distribution method according to claim 1, wherein the determining whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page specifically comprises:
judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page, and judging whether the hash value of the father page of the URL of the first page is the same as the hash value of the father page of the URL of the second page;
when the sub domain names are the same or the hash values are the same, determining that the first page and the second page are adjacent pages;
and when the sub domain names are different and the hash values are not equal, determining that the first page and the second page are not adjacent pages.
4. The task distribution method according to any one of claims 1 to 3, wherein the allocating the first page into a new thread specifically comprises:
distributing the first page to a new thread, and writing a thread number corresponding to the new thread into a thread number item of the URL of the first page; the new thread saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and deleting the URL of the first page from the task list, and writing the URL of the first page into a URL list of the crawled pages.
5. The task distribution method according to any one of claims 1 to 3, wherein the allocating the first page to the thread in which the second page is located specifically comprises:
distributing the first page to the thread where the second page is located, and writing the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and deleting the URL of the first page from the task list, and writing the URL of the first page into a URL list of the crawled pages.
6. A task distribution apparatus, comprising:
the system comprises an analyzing unit, a task list and a processing unit, wherein the analyzing unit is used for analyzing a Uniform Resource Locator (URL) of a first page and a hash value of a parent page of the URL of the first page into the task list;
the judging unit is used for judging whether the URL of the first page is crawled or not;
a determining unit, configured to determine, when the URL of the first page is not crawled, whether the first page and the second page are adjacent pages according to a sub domain name in the URL of the first page and a hash value of a parent page of the URL of the first page, where the first page and the second page are pages that are divided at the same IP address;
the first allocation unit is used for allocating the first page to a new thread after the first page and the second page are determined not to be adjacent pages;
the second allocating unit is used for allocating the first page to a thread where the second page is located after the first page and the second page are determined to be adjacent pages;
and the execution unit is used for executing the downloading task on the distributed threads according to the time sequence control.
7. The task distribution apparatus according to claim 6, wherein the determining unit is specifically configured to determine whether the URL of the first page has been crawled according to a URL list of crawled pages, and if so, delete the URL of the first page from the task list.
8. The task distribution apparatus according to claim 6, wherein the determining unit specifically includes:
the judging module is used for judging whether the sub domain name of the URL of the first page is the same as the sub domain name of the URL of the second page or not and judging whether the hash value of the parent page of the URL of the first page is the same as the hash value of the parent page of the URL of the second page or not when the URL of the first page is not crawled;
a first determining module, configured to determine that the first page and the second page are adjacent pages when the sub domain names are the same or the hash values are equal;
and the second determining module is used for determining that the first page and the second page are not adjacent pages when the sub domain names are different and the hash values are not equal.
9. The task distribution apparatus according to any one of claims 6 to 8, wherein the first allocation unit specifically comprises:
the first distribution module is used for distributing the first page to a new thread and writing a thread number corresponding to the new thread into a thread number item of the URL of the first page; the new thread saves the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and the first deleting module is used for deleting the URL of the first page from the task list and writing the URL of the first page into the URL list of the crawled page.
10. The task distribution apparatus according to any one of claims 6 to 8, wherein the second distribution unit specifically includes:
the second distribution module is used for distributing the first page to the thread where the second page is located and writing the thread number corresponding to the thread where the second page is located in the thread number item of the URL of the first page; the thread where the second page is located stores the hash value of the parent page of the URL of the first page and the name of the URL of the first page;
and the second deleting module is used for deleting the URL of the first page from the task list and writing the URL of the first page into the URL list of the crawled page.
CN201510217232.7A 2015-04-30 2015-04-30 Task distribution method and device Active CN106202077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510217232.7A CN106202077B (en) 2015-04-30 2015-04-30 Task distribution method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510217232.7A CN106202077B (en) 2015-04-30 2015-04-30 Task distribution method and device

Publications (2)

Publication Number Publication Date
CN106202077A CN106202077A (en) 2016-12-07
CN106202077B true CN106202077B (en) 2020-01-21

Family

ID=57457752

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510217232.7A Active CN106202077B (en) 2015-04-30 2015-04-30 Task distribution method and device

Country Status (1)

Country Link
CN (1) CN106202077B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790700A (en) * 2011-05-19 2012-11-21 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN103514301A (en) * 2013-10-24 2014-01-15 深圳市同洲电子股份有限公司 Method and system for scheduling tasks of distributed network crawlers
CN103533097A (en) * 2013-10-10 2014-01-22 北京京东尚科信息技术有限公司 Web crawler downloading and analyzing method and device
CN104391979A (en) * 2014-12-05 2015-03-04 北京国双科技有限公司 Malicious web crawler recognition method and device
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010043210A1 (en) * 1999-01-14 2001-11-22 John Gilbert System and method for the construction of data
US6321265B1 (en) * 1999-11-02 2001-11-20 Altavista Company System and method for enforcing politeness while scheduling downloads in a web crawler
US20080147669A1 (en) * 2006-12-14 2008-06-19 Microsoft Corporation Detecting web spam from changes to links of web sites
US8799262B2 (en) * 2011-04-11 2014-08-05 Vistaprint Schweiz Gmbh Configurable web crawler
US9258289B2 (en) * 2013-04-29 2016-02-09 Arbor Networks Authentication of IP source addresses

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790700A (en) * 2011-05-19 2012-11-21 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN103533097A (en) * 2013-10-10 2014-01-22 北京京东尚科信息技术有限公司 Web crawler downloading and analyzing method and device
CN103514301A (en) * 2013-10-24 2014-01-15 深圳市同洲电子股份有限公司 Method and system for scheduling tasks of distributed network crawlers
CN104391979A (en) * 2014-12-05 2015-03-04 北京国双科技有限公司 Malicious web crawler recognition method and device
CN104408182A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Method and device for processing web crawler data on distributed system

Also Published As

Publication number Publication date
CN106202077A (en) 2016-12-07

Similar Documents

Publication Publication Date Title
US10673889B2 (en) Selective website vulnerability and infection testing
CN107948314B (en) Business processing method and device based on rule file and server
US9053319B2 (en) Context-sensitive taint processing for application security
US9247016B2 (en) Unified tracking data management
CN107239701B (en) Method and device for identifying malicious website
CN104933363A (en) Method and device for detecting malicious file
JP6687761B2 (en) Coupling device, coupling method and coupling program
US20120143844A1 (en) Multi-level coverage for crawling selection
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
US20110314127A1 (en) Quick deploy of content
CN104219230A (en) Method and device for identifying malicious websites
CN103973635A (en) Page access control method, and related device and system
CN104268229A (en) Resource obtaining method and device based on multi-process browser
CN109325192B (en) Advertisement anti-shielding method and device
CN109450844B (en) Method and device for triggering vulnerability detection
CN110535974A (en) Method for pushing, driving means, equipment and the storage medium of resource to be put
US10742668B2 (en) Network attack pattern determination apparatus, determination method, and non-transitory computer readable storage medium thereof
CN103338233A (en) Load balancing device, Web server, request information processing method and system
CN110392032B (en) Method, device and storage medium for detecting abnormal URL
CN110069217A (en) A kind of date storage method and device
KR101781780B1 (en) System and Method for detecting malicious websites fast based multi-server, multi browser
KR101803225B1 (en) System and Method for detecting malicious websites at high speed based multi-server, multi-docker
CN111026945B (en) Multi-platform crawler scheduling method, device and storage medium
CN106202077B (en) Task distribution method and device
CN113923190B (en) Equipment identification jump identification method and device, server and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant