CN106599094B - Asynchronous network content grabbing system and method - Google Patents

Asynchronous network content grabbing system and method Download PDF

Info

Publication number
CN106599094B
CN106599094B CN201611053534.6A CN201611053534A CN106599094B CN 106599094 B CN106599094 B CN 106599094B CN 201611053534 A CN201611053534 A CN 201611053534A CN 106599094 B CN106599094 B CN 106599094B
Authority
CN
China
Prior art keywords
url
task
grabbing
crawling
network content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611053534.6A
Other languages
Chinese (zh)
Other versions
CN106599094A (en
Inventor
卢刚
孙鹏宇
覃安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201611053534.6A priority Critical patent/CN106599094B/en
Publication of CN106599094A publication Critical patent/CN106599094A/en
Application granted granted Critical
Publication of CN106599094B publication Critical patent/CN106599094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Abstract

The invention provides a network asynchronous grabbing system and a method, wherein the network asynchronous grabbing system comprises a task queue manager, a task queue manager and a task queue manager, wherein the task queue manager is used for providing at least one task queue; the scheduler is used for reading Uniform Resource Locators (URLs) of the network content to be captured from each task queue and triggering the driver to schedule the URLs according to the environment type of the rear end of the task to which the URL belongs; the driver is used for reading the task information of the task to which the URL belongs after being triggered by the scheduler, injecting the URL into the grabbing pool based on the task information, and controlling the frequency of injecting the URL into the grabbing pool according to the task information, wherein the task information comprises a query rate and a concurrency value per second; and the executor is used for reading the URL from the grabbing pool and grabbing the URL. The invention can ensure the stability of the grabbing system at high concurrency, effectively save system resources and improve grabbing performance.

Description

Asynchronous network content grabbing system and method
Technical Field
The invention relates to the technical field of internet, in particular to a network content asynchronous capturing system and a network content asynchronous capturing method.
Background
With the development of the internet, the internet may contain a large amount of network contents, and in some application scenarios, some computer technologies are required to extract the network contents required by the user from the large amount of network contents, and the computer technologies are called capturing. For example, web content may be crawled by using a crawler.
In the related art, a gripper adopts a concurrency control strategy or a Query Per Second (QPS) control strategy, wherein the concurrency control strategy independently controls the length of a total concurrent queue through a thread or a process, each process or thread synchronously performs gripping, the total length of the queue is fixed, the pressure on a system is fixed, and the QPS control strategy performs gripping through a fixed frequency.
In the two modes, the control granularity is too coarse, the capturing performance is poor for a slow back-end system, the stability of capturing network contents cannot be fully guaranteed, and the avalanche effect of a capturing system is easily caused.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide an asynchronous capturing system for network content, which can guarantee the stability of the capturing system at high concurrency, effectively save system resources, and improve the capturing performance.
The invention also aims to provide a network content asynchronous grabbing method.
In order to achieve the above object, an embodiment of the first aspect of the present invention provides a system for asynchronously capturing network content, including: a task queue manager for providing at least one task queue; the scheduler is used for reading Uniform Resource Locators (URLs) of the network content to be captured from each task queue and triggering the driver to schedule the URLs according to the environment type of the rear end of the task to which the URLs belong; the driver is used for reading task information of a task to which the URL belongs after being triggered by the scheduler, injecting the URL into a grabbing pool based on the task information, and controlling the frequency of injecting the URL into the grabbing pool according to the task information, wherein the task information comprises a query rate and a concurrency value per second; and the executor is used for reading the URL from the grabbing pool and grabbing the URL.
According to the asynchronous network content grabbing system provided by the embodiment of the first aspect of the invention, the URL of the network content to be grabbed is read from each task queue, the driver is triggered to dispatch the URL according to the environment type of the rear end of the task to which the URL belongs, the task information of the task to which the URL belongs is read, the URL is injected into the grabbing pool based on the task information, the frequency of the URL injected into the grabbing pool is controlled according to the task information, the task information comprises the query rate and the concurrency value per second, the URL is read from the grabbing pool, and the URL is grabbed, so that the stability of the grabbing system can be guaranteed at high concurrency, the system resources are effectively saved, and the grabbing performance is improved.
In order to achieve the above object, an embodiment of a second aspect of the present invention provides a method for asynchronously capturing network content, including: acquiring at least one task queue; reading a Uniform Resource Locator (URL) of network content to be captured from each task queue, and triggering a driver to schedule the URL according to the environment type of the rear end of the task to which the URL belongs; reading task information of a task to which the URL belongs, injecting the URL into a grabbing pool based on the task information, and controlling the frequency of injecting the URL into the grabbing pool according to the task information, wherein the task information comprises a query rate and a concurrency value per second; and reading the URL from the grabbing pool, and grabbing the URL.
According to the asynchronous network content grabbing method provided by the embodiment of the second aspect of the invention, the URL of the network content to be grabbed is read from each task queue, the driver is triggered to dispatch the URL according to the environment type of the rear end of the task to which the URL belongs, the task information of the task to which the URL belongs is read, the URL is injected into the grabbing pool based on the task information, the frequency of the URL injected into the grabbing pool is controlled according to the task information, the task information comprises the query rate and the concurrency value per second, the URL is read from the grabbing pool, and the URL is grabbed, so that the stability of a grabbing system can be guaranteed at high concurrency, the system resources are effectively saved, and the grabbing performance is improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic structural diagram of a network content asynchronous crawling system according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a system for asynchronously crawling web content according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of the efficiency of grabbing in an embodiment of the present invention;
fig. 4 is a flowchart illustrating a method for asynchronously capturing network content according to an embodiment of the present invention;
fig. 5 is a flowchart illustrating a method for asynchronously crawling network content according to another embodiment of the present invention;
fig. 6 is a flowchart illustrating a method for asynchronously crawling network content according to another embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Fig. 1 is a schematic structural diagram of a network content asynchronous crawling system according to an embodiment of the present invention.
Referring to fig. 1, the network content asynchronous crawling system comprises: a task queue manager 100 for providing at least one task queue; the scheduler 200 is configured to read a uniform resource locator URL of the to-be-captured web content from each task queue, and trigger the driver 300 to schedule the URL according to an environment type of a rear end of a task to which the URL belongs; the driver 300 is configured to read task information of a task to which the URL belongs after being triggered by the scheduler 200, inject the URL into the capture pool 400 based on the task information, and control the frequency of the URL injected into the capture pool 400 according to the task information, where the task information includes a query rate per second and a concurrency value; and the executor 500 is used for reading the URL from the grabbing pool and grabbing the URL.
In one embodiment of the invention, the network content asynchronous grabbing system comprises: a task queue manager 100 for providing at least one task queue.
In the embodiment of the present invention, the task queues are pre-placed in the task queue manager 100, where the task queues are at least one, and each task queue includes at least one uniform resource locator URL of the web content to be crawled.
In the embodiment of the present invention, before the task queue is put into the task queue manager 100 in advance, task information of each task queue may be configured, where the task information may include, for example, an ID of a task to which the task queue belongs, an environment type of a back end where the task is located, a QPS, a concurrency value required for executing the task, and the like, and further, after the task information of each task queue is configured, the task information may be written into a data table in a database and then scheduled later, which is not limited herein.
In an embodiment of the present invention, each task queue further includes additional information, such as a Header (Header), of each URL of the web content to be captured, and scheduling is performed subsequently, which is not limited herein.
In the embodiment of the invention, the task queue can adopt a list data structure provided by the redis service to perform the operation of pushing and popping the URL of the network content to be captured so as to realize the data structure of the queue, for example, when the URL in the task queue needs to be scheduled, the URL can be popped by using an rpop method, when the URL needs to be written into the task queue, the URL can be pushed by using an rpush method, and the operation is simple and easy to realize.
In one embodiment of the invention, the network content asynchronous grabbing system comprises: the scheduler 200 is configured to read a uniform resource locator URL of the to-be-captured web content from each task queue, and trigger the driver 300 to schedule the URL according to an environment type of a backend of a task to which the URL belongs.
In the embodiment of the present invention, the scheduler 200 may implement a function of global policy control, and in the process of capturing the network content, the scheduler 200 may traverse the URL of the network content to be captured in each task queue, and obtain the environment type of the back end of the task to which the URL belongs, so as to perform scheduling according to the environment type of the back end, for example, according to the environment type of the back end, it may be determined that the task that needs to be currently captured, the task that needs to be currently paused, and the task that needs to be currently completed are currently executed, so that the back end linkage of multiple environment types may be implemented, the control effect of the network content asynchronous capturing system is enhanced, and the flexibility of capturing the network content when concurrence is increased is effectively provided.
For example, the background server of the scheduler 200 may read a preset data table, where each environment type and concurrent information corresponding to each environment type may be recorded in the preset data table, where the concurrent information is, for example, a total concurrent value and a single environment concurrent value of a back end of the environment type a, where the total concurrent value represents a sum of numbers of URLs of the to-be-crawled network contents that can be borne by the back end of the environment type a, and the single environment concurrent value represents a number of URLs of the to-be-crawled network contents that can be borne by one back end of the environment type a. Further, after reading the environment type of the back end of the task to which the URL of the to-be-captured web content belongs, the scheduler 200 may calculate a remaining concurrent value of the environment type at the current time point, and if the remaining concurrent value is insufficient, capture of the URL may not be triggered.
In the embodiment of the present invention, when the scheduler 200 determines that the URL of a piece of web content to be crawled is a task that needs to be crawled currently, the state of the task is further set to be in execution, and the corresponding driver 300 of the task is started.
Optionally, in some embodiments, referring to fig. 2, the scheduler 200 comprises:
and a reading module 210, configured to read a URL from each task queue.
The scheduling module 220 is configured to trigger the driver 300 to schedule the URL according to the environment type of the backend to which the task belongs.
Optionally, in some embodiments, referring to fig. 2, the scheduling module 220 includes:
the first obtaining sub-module 221 is configured to obtain an environment type at the back end of the task to which the URL belongs.
And a second obtaining sub-module 222, configured to obtain the concurrent information corresponding to the environment type according to the corresponding relationship between the environment type and the concurrent.
And the judging submodule 223 is configured to judge whether the remaining concurrency value of the environment type reaches a preset threshold according to the concurrency information.
The scheduling sub-module 224 is configured to trigger the driver 300 to schedule the URL when the remaining concurrency value does not reach the preset threshold, and not trigger the driver 300 to schedule the URL when the remaining concurrency value reaches the preset threshold.
In one embodiment of the invention, the network content asynchronous grabbing system comprises: and the driver 300 is configured to, after being triggered by the scheduler 200, read task information of a task to which the URL belongs, inject the URL into the capture pool 400 based on the task information, and control frequency of the URL injection into the capture pool 400 according to the task information, where the task information includes a query rate per second and a concurrency value.
In an embodiment of the present invention, referring to fig. 1, each task queue corresponds to one driver 300, and it can be understood that a plurality of task queues correspond to a plurality of drivers 300.
In an embodiment of the present invention, the drivers 300 are policy controllers for one task queue task, and each driver 300 performs scheduling of the corresponding task. The driver 300 may be triggered and started by the scheduler 200, and when the driver 300 is in a start state, the query rate per second QPS and the concurrency value in the task information are read, and the rpop method is executed according to the task information to schedule the URL of the web content to be fetched from the task queue.
In some embodiments, driver 300 is further configured to: and acquiring the identifier of the URL, and correspondingly storing the identifier and the corresponding URL based on a set data structure of the redis service to generate record information of the URL.
In the embodiment of the present invention, after the driver 300 generates the record information of the URL, the URL may be sent to the capture pool 400, and after each capture result is returned, the callback function of the worker deletes the record information of the URL from the set data structure, so that the storage space can be effectively saved.
In the embodiment of the present invention, the driver 300 controls the frequency of injecting the URL into the capture pool 400 according to the task information, where the task information includes the query rate per second QPS and the concurrency value, and can ensure that the concurrency of the URL of a single website content to be captured is controllable. By controlling the QPS of the URL of the single website content to be captured, the capture strategy of the URL of the single website content to be captured is realized.
In the embodiment of the present invention, in the process of capturing the URL of a single website content to be captured, the driver 300 may scan and read the task information of the task to which the URL belongs at a preset time point, so as to dynamically monitor the change of the task information, and further improve the flexibility of capturing the network content at high concurrency.
In an embodiment of the present invention, the network content asynchronous crawling system may further include: the pond 400 is grabbed.
The crawling pool 400 includes URLs of a plurality of web contents to be crawled.
Specifically, the driver 300 corresponding to each task queue may place the determined task currently needing to be executed for grabbing into the grabbing pool 400.
In the embodiment of the present invention, the capture pool 400 may adopt a blocking queue method (i.e., a list data structure and a brpop method cooperate) provided by the redis service, which can effectively improve the capture efficiency.
In one embodiment of the invention, the network content asynchronous grabbing system comprises: and the executor 500 is used for reading the URL from the crawling pool 400 and crawling the URL.
Optionally, in some embodiments, referring to fig. 2, the network content asynchronous crawling system further includes:
and the obtaining module 500 is configured to obtain an identifier of the captured URL as a target identifier, and delete record information of the URL corresponding to the target identifier in the set data structure.
In an embodiment of the present invention, the executor 500 is an execution unit that executes the fetch and performs the encapsulation forwarding, and the number of the executor 500 may be at least one. The executors 500 are blocked in the capture pool 400 by a brpop method, and when it is monitored that the capture pool 400 receives a URL of a website content to be captured, the multiple executors 500 can perform preemption execution on the URL of the website content to be captured. In addition, in the embodiment of the present invention, since the executor 500 consumes resources, the distributed deployment may be performed on the executor 500, and since the list data structure and the brpop method are adopted in cooperation, it is possible to implement that the list data structure and the executor 500 in the blocking state are not deployed on the same host, and therefore, it is possible to implement that different numbers of executors 500 are started on hosts with different performances, thereby implementing load balancing. After the capturing is completed, the executor 500 calls a callback function of the URL of the content of the website to be captured, and deletes the record information of the URL of the content of the website to be captured in the driver 300, thereby completing the capturing lifecycle of the URL of the content of the website to be captured.
As an example, referring to fig. 3, fig. 3 is a schematic diagram of the grabbing efficiency in the embodiment of the present invention, as can be seen from fig. 3, before 11, month and 12 days in 2015, the grabbing time is all over 30 minutes, and the system design requirement is that the target grabbing time is lower than 30 minutes, it is obvious that the original asynchronous grabber cannot meet the system design requirement, the grabbing performance is poor, and after 11, month and 12 days, after the network content asynchronous grabbing system in the embodiment of the present invention is operated online, the grabbing time meets the target grabbing time and is lower than 30 minutes, the grabbing efficiency is improved by about 20%, the load is more balanced, the concurrence control strategy is more reasonable, and the influence of the combined action of factors such as intermediate process consumption is reduced.
In the embodiment, the uniform resource locator URL of the network content to be grabbed is read from each task queue, the driver is triggered to dispatch the URL according to the environment type of the rear end of the task to which the URL belongs, the task information of the task to which the URL belongs is read, the URL is injected into the grabbing pool based on the task information, the frequency of the URL injected into the grabbing pool is controlled according to the task information, the task information comprises the query rate and the concurrency value per second, the URL is read from the grabbing pool, the URL is grabbed, the stability of the grabbing system can be guaranteed at high concurrency, system resources are effectively saved, and grabbing performance is improved.
Fig. 4 is a flowchart illustrating a method for asynchronously capturing network content according to an embodiment of the present invention.
Referring to fig. 4, the method for asynchronously crawling network content includes:
s41: at least one task queue is obtained.
S42: and reading a Uniform Resource Locator (URL) of the network content to be captured from each task queue, and triggering a driver to schedule the URL according to the environment type of the rear end of the task to which the URL belongs.
In the embodiment of the invention, the types of environments of the back ends of the tasks to which the URLs belong are different or the same.
In some embodiments, referring to fig. 5, step S42 specifically includes:
s51: and reading a Uniform Resource Locator (URL) of the network content to be captured from each task queue, and acquiring the environment type of the back end of the task to which the URL belongs.
S52: and acquiring concurrency information corresponding to the environment type according to the corresponding relation between the environment type and the concurrency.
S53: and judging whether the residual concurrency value of the environment type reaches a preset threshold value or not according to the concurrency information.
S54: and when the residual concurrency value does not reach the preset threshold value, the driver is triggered to schedule the URL, and when the residual concurrency value reaches the preset threshold value, the driver is not triggered to schedule the URL.
In the embodiment, the concurrent information corresponding to the environment type is acquired according to the corresponding relation between the environment type of the back end where the task to which the URL belongs and the concurrency, whether the remaining concurrent value of the environment type reaches the preset threshold value is judged according to the concurrent information, when the remaining concurrent value does not reach the preset threshold value, the driver is triggered to schedule the URL, and when the remaining concurrent value reaches the preset threshold value, the driver is not triggered to schedule the URL, so that the function of global policy control can be realized, the linkage of the back ends of a plurality of environment types is realized, the control effect of the network content asynchronous capture system is enhanced, and the flexibility of network content capture during concurrency is effectively improved.
S43: and reading task information of a task to which the URL belongs, injecting the URL into the grabbing pool based on the task information, and controlling the frequency of injecting the URL into the grabbing pool according to the task information, wherein the task information comprises a query rate and a concurrency value per second.
In an embodiment of the present invention, the capture pool may store the URL using a list data structure of a redis database.
S44: and reading the URL from the grabbing pool, and grabbing the URL.
In some embodiments, referring to fig. 6, the method for asynchronously crawling web content further includes:
s61: and acquiring the identifier of the URL, and correspondingly storing the identifier and the corresponding URL based on a set data structure of the redis service to generate record information of the URL.
S62: and acquiring the identifier of the captured URL as a target identifier, and deleting the record information of the URL corresponding to the target identifier in the set data structure.
It should be noted that the explanation of the embodiment of the network content asynchronous capture system in the foregoing embodiments of fig. 1 to fig. 3 also applies to the network content asynchronous capture method of the embodiment, and the implementation principle thereof is similar, and is not described herein again.
In this embodiment, the captured URL identifier is acquired as the target identifier, and the record information of the URL corresponding to the target identifier in the set data structure is deleted, so that the storage space can be effectively saved.
In the embodiment, the uniform resource locator URL of the network content to be grabbed is read from each task queue, the driver is triggered to dispatch the URL according to the environment type of the rear end of the task to which the URL belongs, the task information of the task to which the URL belongs is read, the URL is injected into the grabbing pool based on the task information, the frequency of the URL injected into the grabbing pool is controlled according to the task information, the task information comprises the query rate and the concurrency value per second, the URL is read from the grabbing pool, the URL is grabbed, the stability of the grabbing system can be guaranteed at high concurrency, system resources are effectively saved, and grabbing performance is improved.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (13)

1. A system for asynchronously crawling network content, comprising:
a task queue manager for providing at least one task queue;
the scheduler is used for reading Uniform Resource Locators (URLs) of the network content to be captured from each task queue and triggering the driver to schedule the URLs according to the environment type of the rear end of the task to which the URLs belong;
the driver is used for reading task information of a task to which the URL belongs after being triggered by the scheduler, injecting the URL into a grabbing pool based on the task information, and controlling the frequency of injecting the URL into the grabbing pool according to the task information, wherein the task information comprises a query rate and a concurrency value per second;
and the executor is used for reading the URL from the grabbing pool and grabbing the URL.
2. The system for asynchronous crawling of network content as recited in claim 1, wherein said scheduler comprises:
the reading module is used for reading the URL from each task queue;
and the scheduling module is used for triggering a driver to schedule the URL according to the environment type of the rear end of the task to which the URL belongs.
3. The system for asynchronous crawling of network content as claimed in claim 2, wherein said scheduling module comprises:
the first obtaining submodule is used for obtaining the environment type of the rear end of the task to which the URL belongs;
the second obtaining submodule is used for obtaining concurrent information corresponding to the environment type according to the corresponding relation between the environment type and the concurrency;
the judgment submodule is used for judging whether the residual concurrency value of the environment type reaches a preset threshold value according to the concurrency information;
and the scheduling submodule is used for triggering the driver to schedule the URL when the residual concurrency value does not reach the preset threshold value, and not triggering the driver to schedule the URL when the residual concurrency value reaches the preset threshold value.
4. The system for asynchronous crawling of network content as recited in claim 1, wherein said crawling pool stores said URLs using list data structure of a redis database.
5. The system for asynchronous crawling of network content as recited in claim 1, wherein said driver is further configured to:
and acquiring the identifier of the URL, and correspondingly storing the identifier and the corresponding URL based on a set data structure of the redis service to generate record information of the URL.
6. The system for asynchronous crawling of network content as claimed in claim 5, further comprising:
and the acquisition module is used for acquiring the identifier of the captured URL as a target identifier and deleting the record information of the URL corresponding to the target identifier in the set data structure.
7. The system for asynchronously crawling web content as claimed in claim 1, 2 or 3, wherein the types of environments where the tasks belonging to the URLs belong to the backend are different or the same.
8. A method for asynchronously grabbing network content is characterized by comprising the following steps:
acquiring at least one task queue;
reading a Uniform Resource Locator (URL) of network content to be captured from each task queue, and triggering a driver to schedule the URL according to the environment type of the rear end of the task to which the URL belongs;
reading task information of a task to which the URL belongs, injecting the URL into a grabbing pool based on the task information, and controlling the frequency of injecting the URL into the grabbing pool according to the task information, wherein the task information comprises a query rate and a concurrency value per second;
and reading the URL from the grabbing pool, and grabbing the URL.
9. The asynchronous web content crawling method according to claim 8, wherein the triggering driver to schedule the URL according to the environment type of the backend to which the task belongs comprises:
acquiring the environment type of the back end of the task to which the URL belongs;
acquiring concurrent information corresponding to the environment type according to the corresponding relation between the environment type and the concurrency;
judging whether the residual concurrency value of the environment type reaches a preset threshold value or not according to the concurrency information;
and when the residual concurrency value does not reach the preset threshold value, triggering the driver to schedule the URL, and when the residual concurrency value reaches the preset threshold value, not triggering the driver to schedule the URL.
10. The method for asynchronous crawling of network content as recited in claim 8, wherein said crawling pool stores said URLs using list data structure of a redis database.
11. The method for asynchronous crawling of network content as recited in claim 8, further comprising:
and acquiring the identifier of the URL, and correspondingly storing the identifier and the corresponding URL based on a set data structure of the redis service to generate record information of the URL.
12. The method for asynchronous crawling of network content as recited in claim 11, further comprising:
and acquiring the identifier of the captured URL as a target identifier, and deleting the record information of the URL corresponding to the target identifier in the set data structure.
13. The asynchronous web content crawling method according to claim 8 or 9, wherein the types of environments where the tasks belonging to the URLs belong to the backend are different or the same.
CN201611053534.6A 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method Active CN106599094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611053534.6A CN106599094B (en) 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611053534.6A CN106599094B (en) 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method

Publications (2)

Publication Number Publication Date
CN106599094A CN106599094A (en) 2017-04-26
CN106599094B true CN106599094B (en) 2020-05-22

Family

ID=58591924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611053534.6A Active CN106599094B (en) 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method

Country Status (1)

Country Link
CN (1) CN106599094B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291824A (en) * 2017-05-25 2017-10-24 北京小度信息科技有限公司 Data grab method and device
CN110955469B (en) * 2019-11-25 2023-09-26 中国银行股份有限公司 Method and device for online transaction of distributed batch call of X86 platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8090684B2 (en) * 2009-08-26 2012-01-03 Oracle International Corporation System and method for asynchronous crawling of enterprise applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种网络爬虫的带缓存非阻塞异步;陈言等;《软件导刊》;20091231;第8卷(第11期);第143-146页 *
分布式环境下的网络爬虫系统研究与优化;耿令宝;《中国优秀硕士学位论文全文数据库信息科技辑》;20150815(第08期);I139-310 *

Also Published As

Publication number Publication date
CN106599094A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
US7302686B2 (en) Task management system
US7827167B2 (en) Database management system and method including a query executor for generating multiple tasks
CN104601696B (en) Service processing method, service calling system, device and system
US9535768B2 (en) Managing multi-threaded operations in a multimedia authoring environment
US10769026B2 (en) Dynamically pausing large backups
US20080313502A1 (en) Systems, methods and computer products for trace capability per work unit
US20090183162A1 (en) Priority Based Scheduling System for Server
CN109582466A (en) A kind of timed task executes method, distributed server cluster and electronic equipment
CN106302632B (en) Downloading method of basic mirror image and management node
WO2016078259A1 (en) Streaming data reading method based on embedded file system
US8583608B2 (en) Maximum allowable runtime query governor
WO2014019349A1 (en) File merge method and device
CN111324427B (en) Task scheduling method and device based on DSP
CN103761474B (en) A kind of method and device for monitoring the execution time of a monitoring method
CN105824691B (en) The method and device of dynamic regulation thread
CN106599094B (en) Asynchronous network content grabbing system and method
TWI394074B (en) Methods, apparatus and computer programs for managing access to storage
CN104536813A (en) Accelerating method and device for computing equipment
WO2018177350A1 (en) Method and apparatus for providing serial number, electronic device and readable storage medium
CN106126335A (en) The Media Survey method of terminal unit and terminal unit
US9176783B2 (en) Idle transitions sampling with execution context
US20150026694A1 (en) Method of processing information, storage medium, and information processing apparatus
CN116450287A (en) Method, device, equipment and readable medium for managing storage capacity of service container
US20120331235A1 (en) Memory management apparatus, memory management method, control program, and recording medium
CN105574008B (en) Task scheduling method and device applied to distributed file system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant