WO2021197392A1 - 任务队列生成 - Google Patents

任务队列生成 Download PDF

Info

Publication number
WO2021197392A1
WO2021197392A1 PCT/CN2021/084658 CN2021084658W WO2021197392A1 WO 2021197392 A1 WO2021197392 A1 WO 2021197392A1 CN 2021084658 W CN2021084658 W CN 2021084658W WO 2021197392 A1 WO2021197392 A1 WO 2021197392A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
processed
tasks
processing node
weight
Prior art date
Application number
PCT/CN2021/084658
Other languages
English (en)
French (fr)
Inventor
官砚楚
彭雪银
金戈
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021197392A1 publication Critical patent/WO2021197392A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority

Definitions

  • the embodiments of this specification relate to the field of information technology, and in particular to task queue generation.
  • crawler technology When performing tasks such as batch forensics and downloading, it is often necessary to use crawler technology. For example, using crawler nodes to crawl and download data on a page related to evidence, and so on. In the process of batch crawling, if the task is not processed, it will often happen that the same crawler node visits a certain site too frequently, which causes the crawler node to be blocked, which affects the crawling of data and reduces the crawling efficiency. . Based on this, there is a need for a task queue generation scheme that can improve crawling efficiency.
  • the purpose of the embodiments of the present application is to provide a task queue generation solution that can improve crawling efficiency.
  • the embodiment of this specification provides a task queue generation method, which includes: obtaining a plurality of tasks to be processed, wherein any task to be processed includes a uniform resource locator URL; for any task to be processed, determining what it contains Calculate the hash value of the host name in the URL of the host name; determine the remainder of the hash value of the host name to the length N based on a predefined array of length N; assign the task to be processed To the array elements corresponding to the remainder in the array, where the array elements are used to store tasks to be allocated; one task to be processed is taken out from each array element of the array, and the N pieces taken out The pending tasks are added to the end of the current task queue to generate a new task queue.
  • the embodiment of the present specification also provides a method for allocating tasks based on the aforementioned task queue, including: for any task processing node, obtaining the weight of the task processing node; For tasks, the tasks to be classified are assigned to each task processing node according to the weight of each task processing node, where the weight of the task processing node is positively related to the probability of being assigned to the task.
  • an embodiment of this specification also provides a task queue generating device, including: an acquisition module to acquire a plurality of tasks to be processed, wherein any task to be processed includes a uniform resource locator URL; and a calculation module for For any task to be processed, determine the host name in the URL contained in it, and calculate the hash value of the host name; the determining module, based on a predefined array of length N, determine the hash value pair of the host name The remainder of the length N; an allocation module, which allocates the task to be processed to the array element corresponding to the remainder in the array, wherein the array element is used to store the task to be allocated; an add module, from the One task to be processed is taken out of each array element of the array, and the taken out N tasks to be processed are added to the end of the current task queue to generate a new task queue.
  • the embodiment of the present specification also provides a task queue allocation device based on the aforementioned rights, including: a weight acquisition module, for any task processing node, acquires the weight of the task processing node; an allocation module, from The other end of the task queue takes out tasks to be processed in sequence, and assigns the taken out tasks to be classified to each task processing node according to the weight of each task processing node, wherein the weight of the task processing node is positively correlated with the probability of being assigned to the task.
  • a weight acquisition module for any task processing node, acquires the weight of the task processing node
  • an allocation module from The other end of the task queue takes out tasks to be processed in sequence, and assigns the taken out tasks to be classified to each task processing node according to the weight of each task processing node, wherein the weight of the task processing node is positively correlated with the probability of being assigned to the task.
  • the tasks that need to be crawled are hashed according to the hosts they need to access, and are allocated to an N-dimensional array with a predefined length, and then each element in the array is taken out.
  • the pending tasks are put into the end of the task queue to form a new task queue.
  • tasks containing URLs of the same host can be effectively spaced apart, avoiding high-frequency access to the same host, and improving crawling efficiency.
  • the corresponding task queue can also be allocated based on the weight of each node to ensure that the same node will not be accessed frequently while improving the efficiency of the task queue being processed.
  • FIG. 1 is a schematic flowchart of a task queue generation method provided by an embodiment of this specification
  • FIG. 2 is a logical schematic diagram of an array provided by an embodiment of the specification
  • Figure 3 is a schematic diagram of a standardized pre-processing provided by an embodiment of the specification
  • FIG. 4 is a method for allocating task queues according to an embodiment of this specification
  • FIG. 5 is a schematic structural diagram of a task queue generating device provided by an embodiment of this specification.
  • FIG. 6 is a schematic structural diagram of a task queue allocation device provided by an embodiment of this specification.
  • Fig. 7 is a schematic structural diagram of a device for configuring the method of the embodiment of this specification.
  • crawler technology In the current application of crawler technology is very extensive.
  • users deposit their own works in the blockchain network, and the server can automatically monitor the deposited works on the network and collect evidence.
  • the result of the monitoring is to obtain a suspected infringing work similar to the one that has been evidenced on the public network.
  • the method of obtaining evidence is to save the monitoring results locally, form evidence and upload it to the blockchain network. Proof in the future.
  • the embodiment of this specification provides a task queue generation method, which is applied to crawling based on the task queue generated by this solution before the task to be processed reaches the crawling node, which can avoid high-frequency access to the same host and improve the crawling. Take efficiency.
  • FIG. 1 is a schematic flowchart of a task queue generation method provided by an embodiment of this specification, including:
  • S101 Acquire multiple tasks to be processed, where any task to be processed includes a uniform resource locator URL.
  • the server invokes the blockchain forensics service it first needs to create a forensic task (that is, a task to be processed), and then send the task to be processed to the soft-chain machine system, and the soft-chain machine system assigns the corresponding task to Crawler node (the crawler node can be one or multiple). After the crawling task is completed, the crawling result is returned to the soft chain machine system, and then the chaining process is completed.
  • a forensic task that is, a task to be processed
  • the soft-chain machine system assigns the corresponding task to Crawler node (the crawler node can be one or multiple).
  • a task to be processed needs to include a clear Uniform Resource Locator (URL).
  • URL Uniform Resource Locator
  • the URL of each to-be-processed task may be the same or different. For example, among 5 tasks to be processed, it is possible that the URLs contained in tasks 1 to 4 are all the same "URL1", and the URL contained in task 5 is "URL2".
  • S103 For any task to be processed, determine the host name in the URL contained therein, and calculate the hash value of the host name.
  • the host name is used to locate the server that provides the resource specified by the URL. For example, in the URL "http://www.alipay.com/a/b", "alipay.com” is the host name.
  • the host name specifies which server the crawling node should access when obtaining the resource pointed to by the URL.
  • the hash value corresponding to the host name can be calculated.
  • Commonly used hash algorithms can include md5, sha256, CRC-32, and so on.
  • the hash value obtained based on actual needs can usually be a long unsigned integer.
  • the hash value is a 32-bit unsigned integer.
  • the UR1 contained in different tasks to be processed can be different or the same. It is easy to know that the host names included in each task to be processed can also be the same or different.
  • S105 Determine the remainder of the hash value of the host name to the length N based on a predefined array of length N.
  • the length of the array N means that the array contains N array elements, and the specific value of N can be set based on actual needs.
  • Each element in the array contains a corresponding index identifier.
  • the index identification may be a natural number from 0 to N.
  • FIG. 2 is a logical schematic diagram of an array provided by an embodiment of the specification.
  • the hash value of the host name contained in the task to be processed has been determined in the preceding section. Therefore, at this time, the remainder or modulo of N can be processed according to the hash value of the task to be processed. , Calculate the remainder of the hash value of the host name to the length N.
  • S107 Allocate the task to be processed to an array element corresponding to the remainder in the array, where the array element is used to store the task to be allocated.
  • the to-be-processed can be allocated to the array element corresponding to the remainder based on the corresponding relationship between the index identifier and the remainder.
  • the index identifier of an array element can be a natural number from 0 to N, if the remainder of the hash value of the host name of a task to be processed to N is "1", the task is allocated to the array element No. 1.
  • S107 Take out one task to be processed from each array element of the array, and add the taken out N tasks to be processed to the end of the current task queue to generate a new task queue.
  • Tasks taken out of the array elements should be deleted from the array elements to avoid duplicate tasks in the queue.
  • adding the N pending tasks to the end of the task queue can ensure that the newly generated task queue is in the short-term.
  • N tasks usually only the same one will be encountered.
  • the hostname task once.
  • the end can be the head or the tail.
  • the "ends" should be consistent when adding pending tasks, that is, all are added to the head or all are added to the tail.
  • tasks should be extracted from the other end to the end in order for processing, that is, tasks close to the end are always post-processed, and tasks close to the other end are processed first.
  • the extracted N pending tasks can be directly used as a new task queue.
  • the internal order of the newly taken out N tasks to be processed may be a random order or an order based on the index identification, which is not limited here.
  • the task can be extracted from the task queue in sequence from the other end of the task queue, and based on the URL of each task to be assigned in the queue, it can be sent to each task in sequence.
  • the host name included in the task accesses the resource.
  • the tasks that need to be crawled are hashed according to the host names they need to access, and they are allocated to an N-dimensional array with a predefined length, and then each element in the array is separately Take out a task to be processed and put it at the end of the task queue to form a new task queue.
  • tasks containing URLs with the same host name can be effectively separated, avoiding high-frequency access to the same host, and improving crawling efficiency.
  • the server can perform automatic infringement monitoring on works that have been evidenced in the blockchain. If suspected infringing works are found, the URL is determined to call the forensics service, and the forensic service is called based on the URL. Resource crawling is performed on pages with suspected infringing works. In fact, because the suspected infringing works are often concentrated on several websites, if the forensic task is not handled, the forensic service may need to visit the same site frequently. Based on the task queue generation method of the embodiment of this specification, the crawling tasks of the same site can be staggered before automatically crawling the evidence, which can effectively avoid the high-frequency forensics of the same site, and can effectively improve the efficiency and efficiency of the forensics. quality.
  • the URL contained in the task to be processed may be manually designated, or may be obtained by the server during the monitoring process.
  • the server may obtain the URL contained in the task to be processed.
  • URL entered by some users is capitalized, such as "HTTP: //WWW.ALIPAY.com"; for another example, some URLs also contain some extra default suffixes, default values, and characters representing empty query strings and many more.
  • FIG. 3 is a schematic diagram of a standardized preprocessing provided by an embodiment of this specification.
  • the URL before " ⁇ " is the URL before the specification
  • the URL after the " ⁇ " is the URL after the canonicalization. Standardization can further reduce the situation that tasks containing the same host name are assigned to similar positions in the task queue.
  • each task to be processed can also be classified.
  • each task to be processed carries a hierarchical number, "1" represents a priority task, and "0" represents a normal task.
  • the priority task can be further extracted, removed from the end, and added to the other end of the current task queue. As mentioned earlier, because the task queue is always processed first from the other end, priority tasks will always be processed first in this way.
  • the captured N tasks also contain multiple priority levels, for example, "2" represents the highest priority task, at this time, it is also possible to sequentially assign different levels of priority tasks according to the priority task classification number.
  • Priority tasks are placed at the other end of the current task queue, where tasks with higher levels are added to the task queue later.
  • the "other end” is the other end opposite to the aforementioned "end”.
  • the number of tasks to be processed in the newly generated task queue exceeds a preset threshold (the specific value can be determined according to the efficiency and performance of the crawling node), you can also suspend the acquisition of new tasks to be processed Tasks, so as to avoid the inconvenience of processing the task queue is too long.
  • a preset threshold the specific value can be determined according to the efficiency and performance of the crawling node
  • the classification identification corresponding to the index identification of the array element can also be determined, wherein the classification identification corresponds to the index identification in a one-to-one correspondence.
  • the classification identifier is added or bound to the task to be processed, and the task to be processed including the classification identifier is generated.
  • the classification identifier may be an index identifier. For example, assuming that the index identifier of an array element is 3, for each task to be assigned in the array element, it can be specified in its data Add "3" to the position as a classification identifier, so that other related devices can recognize the classification identifier, that is, the task is a task from the array element No. 3, which is convenient for subsequent task statistics.
  • FIG. 4 is an allocation method based on the foregoing task queue provided by an embodiment of this specification, including:
  • the weight of a task processing node is related to the historical task data of the task node.
  • the historical task data refers to the data generated by the task processing node after receiving the task to be assigned and performing the crawling process until the task succeeds or fails.
  • S403 Take out the tasks to be processed in sequence from the other end of the task queue, and assign the taken out tasks to be classified to each task processing node according to the weight of each task processing node, where the weight of the task processing node and the probability of being assigned to the task Positive correlation.
  • a weight-based random algorithm can be used to select task processing nodes. The greater the weight of the node, the greater the probability that the task will be assigned, instead of directly assigning the task to the heavier node, avoiding all tasks within a period of time. Assigned to the task processing node with heavier weight, while the node with lower weight is in idle state.
  • calculate the total weight S of each task allocation node and then calculate the weight of a node to obtain the ratio of its weight W to S, so as to keep the probability of assigning to the node consistent with the ratio during task allocation.
  • corresponding task queue allocation based on the weight of each node can also be used to improve the efficiency of task queue processing while ensuring that the same node is not frequently accessed.
  • the calculation of the weight of any task processing node should also be performed in classification, that is, the calculation of a task processing node for various tasks Weights.
  • the hosts they need to access must be different; and if the categories of the two pending tasks are the same, the hosts they need to access may be the same; in other words, By setting the category identifier and category weight, the frequency of accessing the same host from the same task node processing point is further reduced, the probability of being blocked is further avoided, and the crawling efficiency is improved.
  • an embodiment of this specification also provides a task queue generating device, as shown in FIG. 5, which is a schematic structural diagram of a task queue generating device provided by an embodiment of this specification, including:
  • the obtaining module 501 obtains a plurality of tasks to be processed, wherein any task to be processed includes a uniform resource locator URL; the calculation module 503, for any task to be processed, determines the host name in the URL contained therein, and calculates the The hash value of the host name;
  • the determining module 505 determines the remainder of the hash value of the host name to the length N based on a predefined array of length N;
  • An allocation module 507 which allocates the task to be processed to an array element corresponding to the remainder in the array, where the array element is used to store the task to be allocated;
  • the adding module 509 extracts one task to be processed from each array element of the array, and adds the extracted N tasks to be processed to the end of the current task queue to generate a new task queue.
  • the obtaining module 501 performs canonical preprocessing on the URL, and obtains the to-be-processed task containing the preprocessed URL in the canonicalization.
  • the adding module 509 determines the priority level of each task in the N to-be-processed tasks, removes the to-be-processed task with a high priority level from the end, and adds it to the other end of the current task queue.
  • the device further includes a pause module 511, if the number of tasks to be processed in the new task queue exceeds a preset value, the acquisition of new tasks to be processed is suspended.
  • the allocating module 507 determines the classification identification corresponding to the index identification of the array element, and generates a task to be processed containing the classification identification, wherein the classification identification corresponds to the index identification one-to-one;
  • the tasks to be processed identified by the classification are allocated to the array elements corresponding to the remainder in the array.
  • the embodiment of this specification also provides a distribution device based on the aforementioned task queue, as shown in FIG. 6, which is a schematic structural diagram of the distribution device based on the aforementioned task queue provided by the embodiment of this specification ,include:
  • the weight obtaining module 601 for any task processing node, obtains the weight of the task processing node
  • the allocation module 603 takes out the tasks to be processed in sequence from the other end of the task queue, and allocates the taken out tasks to be classified to each task processing node according to the weight of each task processing node, where the weight of the task processing node is the same as that of the task assigned to the task. The probability is positively correlated.
  • the weight obtaining module 601 obtains the weight of the task processing node for each type of task, wherein the task processing node is for the i type
  • the weight of the task is negatively related to the number of tasks of type i currently being processed by the task processing node, N ⁇ i ⁇ 1; the allocation module 603, according to the weight of each task processing node for the task of type i, assigns the number of tasks of type i to Processing tasks are assigned to each task processing node.
  • the embodiments of this specification also provide a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the tasks shown in FIG. 1 when the program is executed. Queue generation method.
  • the embodiments of this specification also provide a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the tasks shown in FIG. 4 when the program is executed.
  • the allocation method of the queue is not limited to a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the tasks shown in FIG. 4 when the program is executed.
  • the allocation method of the queue is a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the tasks shown in FIG. 4 when the program is executed. The allocation method of the queue.
  • FIG. 7 shows a more specific hardware structure diagram of a computing device provided by an embodiment of this specification.
  • the device may include a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050.
  • the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040 realize the communication connection between each other in the device through the bus 1050.
  • the processor 1010 may be implemented by a general CPU (Central Processing Unit, central processing unit), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for execution related Program to implement the technical solutions provided in the embodiments of this specification.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc.
  • the memory 1020 may store an operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, related program codes are stored in the memory 1020 and called and executed by the processor 1010.
  • the input/output interface 1030 is used to connect an input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the communication interface 1040 is used to connect a communication module (not shown in the figure) to realize the communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 1050 includes a path to transmit information between various components of the device (for example, the processor 1010, the memory 1020, the input/output interface 1030, and the communication interface 1040).
  • the above device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040, and the bus 1050, in the specific implementation process, the device may also include the equipment necessary for normal operation. Other components.
  • the above-mentioned device may also include only the components necessary to implement the solutions of the embodiments of the present specification, and not necessarily include all the components shown in the figures.
  • the embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the task queue generating method shown in FIG. 1 is implemented.
  • the embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for allocating the task queue shown in FIG. 4 is implemented.
  • Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • a typical implementation device is a computer.
  • the specific form of the computer can be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email receiving and sending device, and a game control A console, a tablet computer, a wearable device, or a combination of any of these devices.

Abstract

一种任务队列生成方法、装置及设备。通过将需要爬取的待处理任务根据各自需要访问的主机进行哈希运算,并分配到预定义长度的N维数组中,然后从数组中的每个数组元素中分别取出一个待处理任务,并放入任务队列的端部,从而形成新的任务队列。

Description

任务队列生成 技术领域
本说明书实施例涉及信息技术领域,尤其涉及任务队列生成。
背景技术
在进行诸如批量取证、下载等任务时,经常需要利用到爬虫技术。例如,采用爬虫节点对于某个与证据相关的页面进行资料的爬取下载等等。在批量爬取的过程中,如果对于任务不做处理,就会经常会发生对于同一爬虫节点对于某个站点访问过于频繁而导致爬虫节点被封,进而影响资料的爬取,降低了爬取效率。基于此,需要一种可以提高爬取效率的任务队列生成方案。
发明内容
本申请实施例的目的是提供一种可以提高爬取效率的任务队列生成方案。
为解决上述技术问题,本申请实施例是这样实现的:
一方面,本说明书实施例提供一种任务队列生成方法,包括:获取多个待处理任务,其中,任一待处理任务中包含统一资源定位符URL;针对任一待处理任务,确定其所包含的URL中的主机名,计算所述主机名的哈希值;基于预定义的长度为N的数组,确定所述主机名的哈希值对所述长度N的余数;将该待处理任务分配至所述数组中与所述余数对应的数组元素中,其中,所述数组元素用于存储待分配任务;从所述数组的每个数组元素中各取出一个待处理任务,将取出的N个待处理任务,添加至当前的任务队列中的端部,生成新的任务队列。
另一方面,本说明书实施例还提供一种基于前述的任务队列的分配方法,包括:针对任一任务处理节点,获取该任务处理节点的权重;从所述任务队列的另一端依次取出待处理任务,根据各任务处理节点的权重将取出的待分类任务分配给各任务处理节点,其中,任务处理节点的权重与被分配到任务的概率正相关。
与一方面对应的,本说明书实施例还提供一种任务队列生成装置,包括:获取模块,获取多个待处理任务,其中,任一待处理任务中包含统一资源定位符URL;计算模块,针对任一待处理任务,确定其所包含的URL中的主机名,计算所述主机名的哈希值; 确定模块,基于预定义的长度为N的数组,确定所述主机名的哈希值对所述长度N的余数;分配模块,将该待处理任务分配至所述数组中与所述余数对应的数组元素中,其中,所述数组元素用于存储待分配任务;添加模块,从所述数组的每个数组元素中各取出一个待处理任务,将取出的N个待处理任务,添加至当前的任务队列中的端部,生成新的任务队列。
与另一方面对应的,本说明书实施例还提供一种基于权利前述的任务队列的分配装置,包括:权重获取模块,针对任一任务处理节点,获取该任务处理节点的权重;分配模块,从所述任务队列的另一端依次取出待处理任务,根据各任务处理节点的权重将取出的待分类任务分配给各任务处理节点,其中,任务处理节点的权重与被分配到任务的概率正相关。
通过本说明书实施例所提供的方案,将需要爬取的任务根据各自需要访问的主机进行哈希运算,并分配到预定义长度的N维数组中,然后数组中的每个元素中分别取出一个待处理任务,并放入任务队列的端部,从而形成新的任务队列。通过这种方式,在批量爬取的任务队列中,可以有效的将包含相同主机的URL的任务间隔开,避免了对于同一主机的高频访问,提高了爬取效率。
另一方面,当存在多节点对前述的任务队列进行处理时,还可以采用基于各节点权重进行相应的任务队列的分配,在保障不会高频访问同一节点的同时提升任务队列被处理的效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,不能限制本说明书实施例。此外,本说明书实施例中的任一实施例不需要达到上述的全部效果。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。
图1为本说明书实施例所提供的一种任务队列生成方式的流程示意图;
图2为本说明书实施例所提供出的一种数组的逻辑示意图;
图3为本说明书实施例所提供的一种规范化预处理的示意图;
图4为本说明书实施例所提供的一种任务队列的分配方法;
图5是本说明书实施例提供的一种任务队列生成装置的结构示意图;
图6是本说明书实施例提供的一种任务队列的分配装置的结构示意图;
图7是用于配置本说明书实施例方法的一种设备的结构示意图。
具体实施方式
为了使本领域技术人员更好地理解本说明书实施例中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行详细地描述,显然,所描述的实施例仅仅是本说明书的一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于保护的范围。
在当前爬虫技术的应运用非常的广泛。例如,在基于区块链的取证服务中,用户将自己的作品存证在区块链网络中,而服务端可以自动的对存证作品进行网络监测,并进行取证。监测的结果即为在公开的网络上得到一个与已经存证的作品类似的疑似侵权作品,取证的方式即为需要从将监测得到的结果进行本地保存,形成证据并上传区块链网络,以便日后举证。
在实际应用中,由于一个作品的监测周期可能很长,因此长期处于监测中的作品是非常多的,这就造成需要取证的任务也非常多。经常是批量化的任务需要进行处理。这就容易造成取证节点高频访问同一站点,容易被封。当前的处理方式一般都是优化请求头、设置爬取时间间隔、使用虚拟IP等方案,但还没有针对任务到达爬取节点之前来就对任务进行处理,以避免被封从而提高效率的方案。
基于此,本说明书实施例提供一种任务队列生成方法,应用于待处理任务到达爬取节点之前,基于本方案生成的任务队列进行爬取,可以避免对于同一主机的高频访问,提高了爬取效率。
如图1所示,图1为本说明书实施例所提供的一种任务队列生成方式的流程示意图,包括:
S101,获取多个待处理任务,其中,任一待处理任务中包含统一资源定位符URL。
如前所述,在区块链的取证场景中就经常需要发生批量进行爬取的任务。具体而言,服务端每调用一次区块链取证服务,首先需要创建一个取证任务(即待处理任务),然 后将待处理任务发送给软链机系统,软链机系统将对应的任务分配到爬虫节点(爬虫节点可以是一个,也可以是多个),爬取的任务结束后将爬取结果返回给软链机系统,然后完成上链过程。
在实际应用中,一个待处理任务其中需要包含明确的统一资源定位符(Uniform Resource Locator,URL)。例如,“http://www.alipay.com/a/b”即为一个URL,用于指定本次爬取任务的具体资源。
在多个待处理任务中,各待处理任务的URL可以相同,也可以不同。例如,在5个待处理任务中,可能任务1至4中所包含的URL均为相同的“URL1”,而任务5所包含的URL即为“URL2”。
S103,针对任一待处理任务,确定其所包含的URL中的主机名,计算所述主机名的哈希值。
主机名用于定位提供了URL所指定的资源的服务器。例如,在URL“http://www.alipay.com/a/b”中,“alipay.com”即为主机名。通过主机名指定了在获取URL所指向的资源时,爬取节点应该访问那个服务器。
进而可以计算出该主机名所对应的哈希值。常用的哈希算法可以包括md5,sha256、CRC-32等等。基于实际需要得到的哈希值通常可以是一个长位的无符号整形。例如,哈希值是32位无符号整形。
如前所述,不同的待处理任务所包含的URl可以不同,也可以相同。容易知道,各待处理任务中所包含的主机名同样也是可以相同,或者不同。
S105,基于预定义的长度为N的数组,确定所述主机名的哈希值对所述长度N的余数。
数组的长度N即代表了数组中包含由N个数组元素,N的具体取值可以基于实际需要自行设定。
数组中的每个元素均包含一个对应的索引标识。例如,索引标识可以是从0至N的自然数。或者,索引标识还可以是与数组元素一一对应的N个编号,例如,在N=3时,可以用诸如“A”、“B”和“C”作为数组元素的索引标识。
数组中的每个数组元素均是一个链表或者一个子数组,用于存放还没有加入队列的待分配任务。如图2所示,图2为本说明书实施例所提供的一种数组的逻辑示意图。
针对一个待处理任务,在前述部分中已经确定了该待处理任务中所包含的主机名的哈希值,因此,此时可以根据该待处理任务的哈希值对N取余数或者取模处理,计算得到主机名的哈希值对所述长度N的余数。
S107,将该待处理任务分配至所述数组中与所述余数对应的数组元素中,其中,所述数组元素用于存储待分配任务。
如前所述,由于每个数组元素有一个对应的索引标识,因此,可以基于索引标识与余数的对应关系,将待处理分配至余数所对应的数组元素。
例如,假设数组元素的索引标识可以是从0至N的自然数,如果一个待处理任务的主机名的哈希值对N的余数为“1”,则将该任务分配至1号数组元素中。
容易理解,在这种分配方式下,包含由相同的主机名的待分配任务,总是会被分配到同一个数组中。同时,在同一个数组元素中,包含的待分配任务中的主机名也可以是多样的,如图2中所示。
S107,从所述数组的每个数组元素中各取出一个待处理任务,将取出的N个待处理任务,添加至当前的任务队列中的端部,生成新的任务队列。
从数组元素中被取出的任务,应当从数组元素中删除,以避免在队列中出现重复任务。
如前所述,由于包含由相同的主机名的待分配任务,总是会被分配到同一个数组中。因此,从各数组元素中各取出一个待处理任务,就保证了这一批次的N个待处理任务中不会包含由相同的主机名。
通过这样的方式,将取出的N个待处理任务添加至任务队列的端部,即可以保证新生成的任务队列中,短期来看,在处理N个任务时,通常只会遇到包含同一个主机名的任务一次。此处的端部可以是头部,也可以是尾部。在同一个任务队列中,添加待处理任务时“端部”应当时一致的,即都添加到头部,或者都添加到尾部。
具体而言,任务队列被处理时,应该是从另一端向端部依次提取任务进行处理,即靠近端部的任务总是被后处理,而靠近另一端的任务会被先处理。
当然,在实际应用中,如果当前还没有任务队列。此时,则可以将取出的N个待处理任务直接作为新的任务队列。此外,对于新取出的N个待处理任务内部本身的顺序,可以是随机顺序,也可以是基于索引标识的顺序,此处不做限制。
基于前述方式生成的任务队列,如果爬取节点只有一个时,即可以从任务队列的另一端开始对于任务队列依序进行任务的提取,并基于队列中各待分配任务的URL,依序向各任务中所包含的主机名进行资源的访问。
通过本说明书实施例所提供的方案,将需要爬取的任务根据各自需要访问的主机名进行哈希运算,并分配到预定义长度的N维数组中,然后从数组中的每个元素中分别取出一个待处理任务,并放入任务队列的端部,从而形成新的任务队列。通过这种方式,在批量爬取的任务队列中,可以有效的将包含相同主机名的URL的任务间隔开,避免了对于同一主机的高频访问,提高了爬取效率。
例如在前述的区块链存证和取证的场景下,服务端可以对于已经存证至区块链中的作品进行自动的侵权监测,如果发现疑似侵权作品则确定URL调用取证服务,基于URL对存在疑似侵权作品的页面进行资源爬取。实际上,由于发生疑似侵权作品的往往也经常是集中在若干网站,如果不对取证任务进行处理,取证服务就有可能需要高频访问同一站点。而基于本说明书实施例的任务队列生成方式,可以在自动爬取证据之前,就将同一站点的爬取任务进行错开,可以有效的避免对于同一站点的高频取证,可以有效提高取证的效率和质量。
在一种实施方式中,由于待处理的任务中所包含的URL可以是人为指定的,也可以是由服务端在监测过程中所获取的。这就造成各URL并没有统一的规范,从而容易导致后续的处理失败。
例如,有些用户所输入的URL中为大写,例如“HTTP://WWW.ALIPAY.com”;又例如,有些URL中还包含有一些多余的默认后缀、默认取值、代表空查询串的字符等等。
如果直接对这些URL进行获取使用,那么一方面由于URL的不规范容易造成后续的爬取失败;另一方面,基于这种不规范的URL获取得到的主机名也会对于后续的分配造成影响。
例如,对于主机名“ALIPAY.com”和“alipay.com”,由于大小写不一致,就会造成二者在被计算哈希时所得到的哈希值不一致,从而对于包含有“ALIPAY.com”和“alipay.com”待分配任务就很大可能不会被分配至同一数组元素中,进而在从N个数组元素中取出待分配任务时,就有可能包含有相同主机名“alipay.com”,显然,这就提高了新生成的任务队列中对于同一主机的访问频次。
基于此,有必要对于URL进行规范化预处理,获取包含规范化预处理的URL的待处理任务。具体的规范化预处理包括将URL协议名或者主机名均做小写化处理,以及,删除URL中所包含的多余无效部分,例如,删除多余的默认后缀、默认取值、代表空查询串的字符等等。如图3所示,图3为本说明书实施例所提供的一种规范化预处理的示意图。在该示意图中,“→”之前的为规范前的URL,“→”之后的为规范化之后的URL。通过规范化可以进一步降低包含有相同主机名的任务被分配在任务队列的相近位置的情形。
在一种实施方式中,还可以将各待处理任务进行分级。例如,各待处理任务携带分级编号,“1”代表优先任务,“0”代表普通任务。在将N个任务添加至当前的任务队列中的端部之后,则还可以将优先任务进一步的提取出来,从端部移除,并添加到至当前的任务队列中的另一端。如前所述,由于任务队列的处理总是从另一端开始先处理的,因此,这种方式下优先任务也总是会被优先处理。
进一步地,如果在抓取得到的N个任务中还包含多个优先级别,例如还有“2”代表最高优先级别任务,则此时,还可以根据优先任务的分级编号,依次将不同级别的优先任务放入当前的任务队列的另一端,其中,级别越高的任务越靠后被添加至任务队列中即可。“另一端”即与前述的“端部”所相反的另一端部。
在一种实施方式中,如果新生成的任务队列中待处理的任务的数量超过了预设的阈值(具体的数值可以根据爬取节点的效率和性能决定),还可以暂停获取新的待处理任务,从而避免任务队列过长不便处理。
在一种实施方式中,待处理任务被分配至数组元素中时,还可以确定将该数组元素的索引标识所对应的分类标识,其中,所述分类标识与所述索引标识一一对应。进而,将分类标识添加或者绑定该待处理任务,生成包含所述分类标识的待处理任务。在一种实施方式中,分类标识即可以是索引标识,例如,假设某个数组元素的索引标识为3,则对于分配到该数组元素中的每个待分配任务,都可以在其数据的指定位置添加“3”作为分类标识,从而其它相关设备可以识别出该分类标识,即该任务为从3号数组元素中所来的任务,便于后续的任务统计。
基于前述已经获取得到的任务队列,在实际进行处理时,可以使用一个爬取节点从任务队列的另一端依序提取任务进行处理。在实际应用中,还可以采用多个即任务处理节点(即爬取节点)从任务队列的另一端依序提取任务,并分配到各任务处理节点进行处理。如图4所示,图4为本说明书实施例所提供的一种基于前述任务队列的分配方法, 包括:
S401,针对任一任务处理节点,获取该任务处理节点的权重。
一个任务处理节点的权重与该任务节点的历史任务数据相关。所述历史任务数据指的是该任务处理节点在接收到待分配任务,在执行爬取的过程中直至任务成功或者失败所产生的数据。
基于一个任务处理节点的历史任务数据,进行性能分析,可以得到得到诸如成功率、平均响应时间、失败原因等等诸多指标。例如,基于根据每个任务处理的当前可用线程数、成功率、平均响应时间,设置权重的公式如下:W=l*a+s*b+t*g,其中,W为权重,l为任务处理节点的当前可用线程数,是实时获取的,s为成功率,t为平均响应时间,a、b、g为可调参数。
S403,从所述任务队列的另一端依次取出待处理任务,根据各任务处理节点的权重将取出的待分类任务分配给各任务处理节点,其中,任务处理节点的权重与被分配到任务的概率正相关。
具体而言,可以使用基于权重的随机算法来选择任务处理节点,权重越大的节点被分配任务的概率越大,而不是直接把任务分配给权重大的节点,避免一段时间内的任务都被分配给权重大的任务处理节点,而权重小的节点处于空闲状态。
例如,计算各任务分配节点权重总和S,进而针对一个节点的权重计算得到其权重W与S的比值,从而在任务分配时保持分配到该节点的概率与该比值一致即可。
通过采用多节点对前述的任务队列进行处理时,还可以采用基于各节点权重进行相应的任务队列的分配,在保障不会高频访问同一节点的同时提升任务队列被处理的效率。
在一种实施方式中,如果任务队列中的待处理任务还包含分类标识,那么此时,针对任一任务处理节点的权重的计算也应该分类进行,即计算一个任务处理节点对于各类任务的权重。
具体而言,首先可以在内存中记录,每个任务节点正在执行的任务中每类任务的数量k,在针对该任务处理节点对于i类任务的权重时(显然,N≥i≥1),还可以加入该任务处理节点当前正在处理的第i类任务的数量作为参数,例如,W=l*a+s*b+t*g-k*d;其中的d也是可调参数。通过引入参数k作为权重的负相关参数,显然对于一个任务节点而言,如果其处理第i类的任务数量越多,其针对第i类任务的权重也就越低,从而第i类待处理任务被分配至该节点的概率越低。
而如前所述,如果两个待处理任务的类别不同,其所需要访问的主机肯定不同;而如果两个待处理任务的类别相同,其要访问的主机则可能是相同的;如换言之,通过设置类别标识和类别权重的方式进一步降低了同一任务节处理点访问相同主机的频率,进一步避免了被封的概率,提高了爬取效率。
与一方面对应的,本说明书实施例还提供一种任务队列生成装置,如图5所示,图5是本说明书实施例提供的一种任务队列生成装置的结构示意图,包括:
获取模块501,获取多个待处理任务,其中,任一待处理任务中包含统一资源定位符URL;计算模块503,针对任一待处理任务,确定其所包含的URL中的主机名,计算所述主机名的哈希值;
确定模块505,基于预定义的长度为N的数组,确定所述主机名的哈希值对所述长度N的余数;
分配模块507,将该待处理任务分配至所述数组中与所述余数对应的数组元素中,其中,所述数组元素用于存储待分配任务;
添加模块509,从所述数组的每个数组元素中各取出一个待处理任务,将取出的N个待处理任务,添加至当前的任务队列中的端部,生成新的任务队列。
进一步地,所述获取模块501,对所述URL进行规范化预处理,获取包含规范化预处理的URL的待处理任务。
进一步地,所述添加模块509,确定所述N个待处理任务中各任务的优先级别,将优先级别高的待处理任务从端部移除,并添加至当前的任务队列中的另一端。
进一步地,所述装置还包括暂停模块511,若所述新的任务队列中待处理任务的数量超过预设值,则暂停获取新的待处理任务。
进一步地,所述分配模块507,确定所述数组元素的索引标识对应的分类标识,生成包含所述分类标识的待处理任务,其中,所述分类标识与所述索引标识一一对应;将包含分类标识的待处理任务分配至所述数组中与所述余数对应的数组元素中。
与另一方面对应的,本说明书实施例还提供一种基于前述任务队列的分配装置,如图6所示,图6是本说明书实施例提供的一种基于前述任务队列的分配装置的结构示意图,包括:
权重获取模块601,针对任一任务处理节点,获取该任务处理节点的权重;
分配模块603,从所述任务队列的另一端依次取出待处理任务,根据各任务处理节点的权重将取出的待分类任务分配给各任务处理节点,其中,任务处理节点的权重与被分配到任务的概率正相关。
进一步地当所述任务队列中的待处理任务还包含分类标识时,相应的,所述权重获取模块601,分别获取该任务处理节点对于各类任务的权重,其中,该任务处理节点对于i类任务的权重与该任务处理节点当前正在处理的第i类任务的数量负相关,N≥i≥1;所述分配模块603,根据各任务处理节点对于i类任务的权重,将第i类待处理任务分配给各任务处理节点。
本说明书实施例还提供一种计算机设备,其至少包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行所述程序时实现图1所示的任务队列生成方法。
本说明书实施例还提供一种计算机设备,其至少包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行所述程序时实现图4所示的任务队列的分配方法。
图7示出了本说明书实施例所提供的一种更为具体的计算设备硬件结构示意图,该设备可以包括:处理器1010、存储器1020、输入/输出接口1030、通信接口1040和总线1050。其中处理器1010、存储器1020、输入/输出接口1030和通信接口1040通过总线1050实现彼此之间在设备内部的通信连接。
处理器1010可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本说明书实施例所提供的技术方案。
存储器1020可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1020可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器1020中,并由处理器1010来调用执行。
输入/输出接口1030用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。
通信接口1040用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。
总线1050包括一通路,在设备的各个组件(例如处理器1010、存储器1020、输入/输出接口1030和通信接口1040)之间传输信息。
需要说明的是,尽管上述设备仅示出了处理器1010、存储器1020、输入/输出接口1030、通信接口1040以及总线1050,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本说明书实施例方案所必需的组件,而不必包含图中所示的全部组件。
本说明书实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现图1所示的任务队列生成方法。
本说明书实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现图4所示的任务队列的分配方法。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书实施例各个实施例或者实施例的某些部分所述的方法。
上述实施例阐明的系统、方法、模块或单元,具体可以由计算机芯片或实体实现, 或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于方法实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的方法实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本说明书实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述仅是本说明书实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本说明书实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本说明书实施例的保护范围。

Claims (16)

  1. 一种任务队列生成方法,包括:
    获取多个待处理任务,其中,任一待处理任务中包含统一资源定位符URL;
    针对任一待处理任务,确定其所包含的URL中的主机名,计算所述主机名的哈希值;
    基于预定义的长度为N的数组,确定所述主机名的哈希值对所述长度N的余数;
    将该待处理任务分配至所述数组中与所述余数对应的数组元素中,其中,所述数组元素用于存储待分配任务;
    从所述数组的每个数组元素中各取出一个待处理任务,将取出的N个待处理任务,添加至当前的任务队列中的端部,生成新的任务队列。
  2. 如权利要求1所述的方法,获取包含统一资源定位符URL的待处理任务,包括:
    对所述URL进行规范化预处理,获取包含规范化预处理的URL的待处理任务。
  3. 如权利要求1所述的方法,将取出的N个待处理任务,添加至当前的任务队列中的端部,包括:
    确定所述N个待处理任务中各任务的优先级别,将优先级别高的待处理任务从端部移除,并添加至当前的任务队列中的另一端。
  4. 如权利要求1所述的方法,还包括:
    若所述新的任务队列中待处理任务的数量超过预设值,则暂停获取新的待处理任务。
  5. 如权利要求1所述的方法,将该待处理任务分配至所述数组中与所述余数对应的数组元素中,包括:
    确定所述数组元素的索引标识对应的分类标识,生成包含所述分类标识的待处理任务,其中,所述分类标识与所述索引标识一一对应;
    将包含分类标识的待处理任务分配至所述数组中与所述余数对应的数组元素中。
  6. 一种基于权利要求1至5任一所述的任务队列的分配方法,包括:
    针对任一任务处理节点,获取该任务处理节点的权重;
    从所述任务队列的另一端依次取出待处理任务,根据各任务处理节点的权重将取出的待分类任务分配给各任务处理节点,其中,任务处理节点的权重与被分配到任务的概率正相关。
  7. 如权利要求6所述的方法,当所述任务队列中的待处理任务还包含分类标识时,
    相应的,获取该任务处理节点的权重,包括:分别获取该任务处理节点对于各类任务的权重,其中,该任务处理节点对于i类任务的权重与该任务处理节点当前正在处理的第i类任务的数量负相关,N≥i≥1;
    相应的,根据各任务处理节点的权重分配给各任务处理节点,包括:根据各任务处理节 点对于i类任务的权重,将第i类待处理任务分配给各任务处理节点。
  8. 一种任务队列生成装置,包括:
    获取模块,获取多个待处理任务,其中,任一待处理任务中包含统一资源定位符URL;
    计算模块,针对任一待处理任务,确定其所包含的URL中的主机名,计算所述主机名的哈希值;
    确定模块,基于预定义的长度为N的数组,确定所述主机名的哈希值对所述长度N的余数;
    分配模块,将该待处理任务分配至所述数组中与所述余数对应的数组元素中,其中,所述数组元素用于存储待分配任务;
    添加模块,从所述数组的每个数组元素中各取出一个待处理任务,将取出的N个待处理任务,添加至当前的任务队列中的端部,生成新的任务队列。
  9. 如权利要求8所述的装置,所述获取模块,对所述URL进行规范化预处理,获取包含规范化预处理的URL的待处理任务。
  10. 如权利要求8所述的装置,所述添加模块,确定所述N个待处理任务中各任务的优先级别,将优先级别高的待处理任务从端部移除,并添加至当前的任务队列中的另一端。
  11. 如权利要求8所述的装置,还包括暂停模块,若所述新的任务队列中待处理任务的数量超过预设值,则暂停获取新的待处理任务。
  12. 如权利要求8所述的装置,所述分配模块,确定所述数组元素的索引标识对应的分类标识,生成包含所述分类标识的待处理任务,其中,所述分类标识与所述索引标识一一对应;将包含分类标识的待处理任务分配至所述数组中与所述余数对应的数组元素中。
  13. 一种基于权利要求1至5任一所述的任务队列的分配装置,包括:
    权重获取模块,针对任一任务处理节点,获取该任务处理节点的权重;
    分配模块,从所述任务队列的另一端依次取出待处理任务,根据各任务处理节点的权重将取出的待分类任务分配给各任务处理节点,其中,任务处理节点的权重与被分配到任务的概率正相关。
  14. 如权利要求12所述的装置,当所述任务队列中的待处理任务还包含分类标识时,
    相应的,所述权重获取模块,分别获取该任务处理节点对于各类任务的权重,其中,该任务处理节点对于i类任务的权重与该任务处理节点当前正在处理的第i类任务的数量负相关,N≥i≥1;
    相应的,所述分配模块,根据各任务处理节点对于i类任务的权重,将第i类待处理任务分配给各任务处理节点。
  15. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1至5任一项所述的方法。
  16. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求6至7任一项所述的方法。
PCT/CN2021/084658 2020-04-02 2021-03-31 任务队列生成 WO2021197392A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010254907.6A CN111158892B (zh) 2020-04-02 2020-04-02 一种任务队列生成方法、装置及设备
CN202010254907.6 2020-04-02

Publications (1)

Publication Number Publication Date
WO2021197392A1 true WO2021197392A1 (zh) 2021-10-07

Family

ID=70567726

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084658 WO2021197392A1 (zh) 2020-04-02 2021-03-31 任务队列生成

Country Status (2)

Country Link
CN (1) CN111158892B (zh)
WO (1) WO2021197392A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114924877A (zh) * 2022-05-17 2022-08-19 江苏泰坦智慧科技有限公司 一种基于数据流的动态分配计算方法、装置和设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158892B (zh) * 2020-04-02 2020-10-02 支付宝(杭州)信息技术有限公司 一种任务队列生成方法、装置及设备
CN112416626B (zh) * 2020-12-02 2023-06-06 中国联合网络通信集团有限公司 一种数据处理方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001033345A1 (en) * 1999-11-02 2001-05-10 Alta Vista Company System and method for enforcing politeness while scheduling downloads in a web crawler
CN103870329A (zh) * 2014-03-03 2014-06-18 同济大学 基于加权轮叫算法的分布式爬虫任务调度方法
CN107273409A (zh) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 一种网络数据采集、存储及处理方法及系统
CN110555147A (zh) * 2018-03-30 2019-12-10 上海媒科锐奇网络科技有限公司 网站数据抓取方法、装置、设备及其介质
CN111158892A (zh) * 2020-04-02 2020-05-15 支付宝(杭州)信息技术有限公司 一种任务队列生成方法、装置及设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104967698B (zh) * 2015-02-13 2018-11-23 腾讯科技(深圳)有限公司 一种爬取网络数据的方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001033345A1 (en) * 1999-11-02 2001-05-10 Alta Vista Company System and method for enforcing politeness while scheduling downloads in a web crawler
CN103870329A (zh) * 2014-03-03 2014-06-18 同济大学 基于加权轮叫算法的分布式爬虫任务调度方法
CN107273409A (zh) * 2017-05-03 2017-10-20 广州赫炎大数据科技有限公司 一种网络数据采集、存储及处理方法及系统
CN110555147A (zh) * 2018-03-30 2019-12-10 上海媒科锐奇网络科技有限公司 网站数据抓取方法、装置、设备及其介质
CN111158892A (zh) * 2020-04-02 2020-05-15 支付宝(杭州)信息技术有限公司 一种任务队列生成方法、装置及设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114924877A (zh) * 2022-05-17 2022-08-19 江苏泰坦智慧科技有限公司 一种基于数据流的动态分配计算方法、装置和设备
CN114924877B (zh) * 2022-05-17 2023-10-17 江苏泰坦智慧科技有限公司 一种基于数据流的动态分配计算方法、装置和设备

Also Published As

Publication number Publication date
CN112199175A (zh) 2021-01-08
CN111158892B (zh) 2020-10-02
CN111158892A (zh) 2020-05-15

Similar Documents

Publication Publication Date Title
WO2021197392A1 (zh) 任务队列生成
WO2017084529A1 (zh) 识别网络攻击的方法和装置
WO2019192103A1 (zh) 并发访问控制方法、装置、终端设备及介质
US10382461B1 (en) System for determining anomalies associated with a request
CN107231402A (zh) Http请求处理方法、装置及系统
KR101719500B1 (ko) 캐싱된 플로우들에 기초한 가속
WO2017028696A1 (zh) 分布式存储系统的负载监控方法及设备
EP3817333B1 (en) Method and system for processing requests in a consortium blockchain
US20170153909A1 (en) Methods and Devices for Acquiring Data Using Virtual Machine and Host Machine
EP3779692B1 (en) Blockchain data processing
TW202027003A (zh) 受理區塊鏈存證交易的方法及系統
JP2014120097A (ja) 情報処理装置、プログラム、及び、情報処理方法
US20210383023A1 (en) System and method for dynamic management of private data
CN111562889B (zh) 数据处理方法、装置、系统及存储介质
CN110677684A (zh) 视频处理、视频访问方法及分布式存储、视频访问系统
CN108154024B (zh) 一种数据检索方法、装置及电子设备
JP2019519849A (ja) サーバへの攻撃を防ぐ方法及びデバイス
CN105988941B (zh) 缓存数据处理方法和装置
CN112532732A (zh) 基于https的会话处理方法及装置
TWI610196B (zh) 網路攻擊模式之判斷裝置、判斷方法及其電腦程式產品
CN110784336A (zh) 基于物联网的多设备智能定时延时场景设置方法及系统
US20090157896A1 (en) Tcp offload engine apparatus and method for system call processing for static file transmission
CN103500108A (zh) 系统内存访问方法、节点控制器和多处理器系统
TW201822053A (zh) 攻擊節點偵測裝置、方法及其電腦程式產品
CN110554905A (zh) 一种容器的启动方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21780068

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21780068

Country of ref document: EP

Kind code of ref document: A1