CN106599094A - Network content asynchronous grasping system and method - Google Patents

Network content asynchronous grasping system and method Download PDF

Info

Publication number
CN106599094A
CN106599094A CN201611053534.6A CN201611053534A CN106599094A CN 106599094 A CN106599094 A CN 106599094A CN 201611053534 A CN201611053534 A CN 201611053534A CN 106599094 A CN106599094 A CN 106599094A
Authority
CN
China
Prior art keywords
url
asynchronous
web content
task
grasping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611053534.6A
Other languages
Chinese (zh)
Other versions
CN106599094B (en
Inventor
卢刚
孙鹏宇
覃安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201611053534.6A priority Critical patent/CN106599094B/en
Publication of CN106599094A publication Critical patent/CN106599094A/en
Application granted granted Critical
Publication of CN106599094B publication Critical patent/CN106599094B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Abstract

The invention proposes a network asynchronous grasping system and method. The network asynchronous grasping system comprises a task queue manager used for providing at least one task queue; a scheduler used for reading a uniform resource locator URL of a network content to be grasped from each task queue, and triggering a driver to schedule the URL according to the environment type of a back end where a task to which the URL belongs locates; the driver used for reading task information of the task of the URL after being triggered by the scheduler, injecting the URL into a grasping pool based on the task information, and controlling the frequency of injecting the URL into the grasping pool according to the task information, wherein the task information comprises the query per second and a concurrency value; and an actuator used for reading the URL from the grasping pool, and grasping the URL. The invention can ensures the stability of the grasping system during high concurrency, effectively save system resources and improve the grasping performance.

Description

The asynchronous grasping system of Web content and method
Technical field
The present invention relates to Internet technical field, more particularly to a kind of asynchronous grasping system of Web content and method.
Background technology
With the development of the Internet, the Internet can include the Web content of magnanimity, under application scenes, need to adopt one A little computer technologies extract the Web content of user's needs from the Web content of magnanimity, and the computer technology is referred to as grabbing Take.For example, Web content can be captured by using grabber.
In correlation technique, grabber adopts Analysis of Concurrency Control, or query rate per second (Query Per Second, QPS) control strategy, wherein, Analysis of Concurrency Control, by thread or the concurrent queue length of process independent control total amount, often Individual process or thread synchronization perform crawl, it is ensured that queue total length is fixed, the pressure fixing to system, and QPS control strategies, Crawl is performed by fixed frequency.
Under both modes, Control granularity is excessively extensive, and for slow back-end system, Grabbing properties are poor, it is impossible to fill The stability of code insurance barrier Web content crawl, easily causes the avalanche effect of grasping system.
The content of the invention
It is contemplated that at least solving one of technical problem in correlation technique to a certain extent.
For this purpose, it is an object of the present invention to propose a kind of asynchronous grasping system of Web content, can be in high concurrent The stability of grasping system is ensured, system resource is effectively saved, Grabbing properties are lifted.
Further object is that proposing a kind of asynchronous grasping means of Web content.
To reach above-mentioned purpose, the asynchronous grasping system of Web content that first aspect present invention embodiment is proposed, including:Appoint Business queue management device, for providing at least one task queue;Scheduler, for reading net to be captured from each task queue The uniform resource position mark URL of network content, and according to the environmental form of the URL affiliated task places rear end triggering driver The URL is scheduled;Driver, after being triggered by the scheduler, reads the task letter of the affiliated tasks of the URL Breath, based on the mission bit stream by URL injection crawls pond, and according to the mission bit stream controls the URL injections The frequency in crawl pond, the mission bit stream includes query rate per second and is concurrently worth;Executor, for reading from the crawl pond The URL, and the URL is captured.
The asynchronous grasping system of Web content that first aspect present invention embodiment is proposed, by reading from each task queue Take the uniform resource position mark URL of Web content to be captured, and according to the environmental form of URL affiliated task places rear end triggering Driver is scheduled to URL, reads the mission bit stream of the affiliated tasks of URL, and URL injections are captured pond by task based access control information, and The frequency in URL injection crawls pond is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth, and from crawl URL is read in pond, and URL is captured, the stability of grasping system can be ensured in high concurrent, effectively save system money Source, lifts Grabbing properties.
To reach above-mentioned purpose, the asynchronous grasping means of Web content that second aspect present invention embodiment is proposed, including:Obtain Take at least one task queue;The uniform resource position mark URL of Web content to be captured, and root are read from each task queue Driver is triggered according to the environmental form of the URL affiliated task places rear end to be scheduled the URL;Read the URL The mission bit stream of affiliated task, based on the mission bit stream by URL injection crawls pond, and controls according to the mission bit stream The URL injects the frequency in the crawl pond, and the mission bit stream includes query rate per second and is concurrently worth;From the crawl pond The URL is read, and the URL is captured.
The asynchronous grasping means of Web content that second aspect present invention embodiment is proposed, by reading from each task queue Take the uniform resource position mark URL of Web content to be captured, and according to the environmental form of URL affiliated task places rear end triggering Driver is scheduled to URL, reads the mission bit stream of the affiliated tasks of URL, and URL injections are captured pond by task based access control information, and The frequency in URL injection crawls pond is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth, and from crawl URL is read in pond, and URL is captured, the stability of grasping system can be ensured in high concurrent, effectively save system money Source, lifts Grabbing properties.
The additional aspect of the present invention and advantage will be set forth in part in the description, and partly will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
The above-mentioned and/or additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments It is substantially and easy to understand, wherein:
Fig. 1 is the structural representation of the asynchronous grasping system of Web content that one embodiment of the invention is proposed;
Fig. 2 is the structural representation of the asynchronous grasping system of Web content that another embodiment of the present invention is proposed;
Fig. 3 is the crawl efficiency schematic diagram in the embodiment of the present invention;
Fig. 4 is the schematic flow sheet of the asynchronous grasping means of Web content that one embodiment of the invention is proposed;
Fig. 5 is the schematic flow sheet of the asynchronous grasping means of Web content that another embodiment of the present invention is proposed;
Fig. 6 is the schematic flow sheet of the asynchronous grasping means of Web content that another embodiment of the present invention is proposed.
Specific embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from start to finish Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.Conversely, this Inventive embodiment includes all changes fallen in the range of the spirit and intension of attached claims, modification and is equal to Thing.
Fig. 1 is the structural representation of the asynchronous grasping system of Web content that one embodiment of the invention is proposed.
Referring to Fig. 1, the asynchronous grasping system of the Web content includes:Job queue 100, for providing at least one Task queue;Scheduler 200, for reading the uniform resource position mark URL of Web content to be captured from each task queue, And URL is scheduled triggering driver 300 according to the environmental form of URL affiliated task places rear end;Driver 300, uses After the triggering of scheduled device 200, the mission bit stream of the affiliated tasks of URL is read, URL injections are captured pond by task based access control information 400, and the frequency in URL injection crawls pond 400 is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth; Executor 500, for reading URL from crawl pond, and URL captured.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Job queue 100, For providing at least one task queue.
In an embodiment of the present invention, task queue is placed in advance in job queue 100, wherein, task Queue is at least one, comprising the uniform resource position mark URL of at least one Web content to be captured in each task queue.
In an embodiment of the present invention, before in advance task queue being put into into job queue 100, can configure The mission bit stream of each task queue, wherein, mission bit stream can for example including ID, the task institute of the affiliated task of the task queue Environmental form, QPS in rear end, and concurrently value of the required by task etc. is performed, further, in configuration each task is finished After the mission bit stream of queue, can be by the tables of data in the mission bit stream write into Databasce, to be subsequently scheduled, here It is not restricted.
In an embodiment of the present invention, the URL's in each task queue also comprising each Web content to be captured is additional Information, for example, header (Header), to be subsequently scheduled, this is not restricted.
In an embodiment of the present invention, task queue can adopt redis to service the list data structures for providing and be treated The press-in and ejection operation of the URL of crawl Web content, to realize the data structure of queue, for example, is needing to task queue When middle URL is scheduled, can eject in rpop methods, when needing URL to be write in task queue, it is possible to use rpush side Method is pressed into, easy realization simple to operate.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Scheduler 200, for from every The uniform resource position mark URL of Web content to be captured is read in individual task queue, and according to URL affiliated task places rear end Environmental form is scheduled triggering driver 300 to URL.
In an embodiment of the present invention, scheduler 200 can realize the function that global policies are controlled, in Web content crawl During, scheduler 200 can travel through the URL of Web content to be captured in each task queue, and obtain the affiliated tasks of the URL The environmental form of place rear end, is scheduled with the environmental form according to rear end, for example, can be true according to the environmental form of rear end Surely it is currently needed for performing the task of crawl, is currently needed for stopped task, be currently needed for terminating for task, multiple rings can be realized The linkage of the rear end of border type, strengthens the control effect of the asynchronous grasping system of Web content, when effectively lifting high concurrent in network Hold the motility of crawl.
For example, the background server of scheduler 200 can read preset data table, wherein, can remember in the preset data table Record each environmental form, and concurrent information corresponding with each environmental form, the concurrent information for example, after environmental form A Total concurrently value and the monocyclic border at end is concurrently worth, wherein, always concurrently value represents that environmental form can be carried for the rear end of environmental form A Web content to be captured URL quantity plus and, monocyclic border is concurrently worth and represents environmental form for after one of environmental form A The quantity of the URL of the Web content to be captured that end can carry.Further, scheduler 200 is being read in network to be captured After the environmental form of the URL affiliated task places rear end of appearance, the residue of the environmental form on current point in time can be calculated concurrent Value, if residue is concurrently worth deficiency, will not trigger and capture the URL, in an embodiment of the present invention, due to varying environment type Between without shared relationship, therefore, scheduler 200 can independently be dispatched to multiple tasks queue and is independent of each other, by strategy and framework Decoupling and.
In an embodiment of the present invention, when scheduler 200 determines the URL of a Web content to be captured to be currently needed for holding During the task of row crawl, during can further the state of the task be arranged to carry out, and start the corresponding driver of the task 300。
Alternatively, in some embodiments, referring to Fig. 2, scheduler 200 includes:
Read module 210, for reading URL from each task queue.
Scheduler module 220, for triggering driver 300 to URL according to the environmental form of URL affiliated task places rear end It is scheduled.
Alternatively, in some embodiments, referring to Fig. 2, scheduler module 220 includes:
First acquisition submodule 221, for obtaining the environmental form of URL affiliated task places rear end.
Second acquisition submodule 222, for being obtained and environmental form pair with concurrent corresponding relation according to environmental form The concurrent information answered.
Judging submodule 223, for judging the residue of environmental form is concurrently worth whether reach default threshold according to concurrent information Value.
Scheduling submodule 224, in remaining concurrently value not up to predetermined threshold value, triggering driver 300 to be carried out to URL Scheduling, and when concurrently value reaches predetermined threshold value to residue, do not trigger driver 300 and URL is scheduled.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Driver 300, for being adjusted After degree device 200 is triggered, the mission bit stream of the affiliated tasks of URL is read, URL injections are captured pond 400, and root by task based access control information The frequency in URL injection crawls pond 400 is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth.
In an embodiment of the present invention, referring to Fig. 1, each task queue one driver 300 of correspondence, it is possible to understand that It is, the multiple drivers 300 of multiple tasks queue correspondence.
In an embodiment of the present invention, driver 300 is the strategy controller for a task queue task, each drive Dynamic device 300 performs the scheduling of correspondence task.Driver 300 can trigger startup by scheduler 200, be in driver 300 and open During dynamic state, query rate QPS per second in mission bit stream can be read and be concurrently worth, and rpop methods are performed according to mission bit stream, To dispatch the URL of Web content to be captured from task queue.
In some embodiments, driver 300 is additionally operable to:Obtain the mark of URL, and the set data knot serviced based on redis Structure will identify storage corresponding with corresponding URL, to generate the record information of URL.
In an embodiment of the present invention, driver 300 can be sent to the URL after the record information for generating URL Crawl pond 400, after often returning a crawl result, the call back function of worker can be from set data structures by the record of the URL Information deletion, can effectively save memory space.
In an embodiment of the present invention, driver 300 controls the frequency in URL injection crawls pond 400 according to mission bit stream, appoints Business information includes query rate QPS per second and being concurrently worth, ensure that single web site contents to be captured URL it is concurrent controllable.It is logical The QPS of the URL for controlling single web site contents to be captured is crossed, the crawl strategy of the URL of single web site contents to be captured is realized.
In an embodiment of the present invention, during the URL to single web site contents to be captured is captured, driver 300 can scan the mission bit stream for reading the affiliated tasks of URL on Preset Time point, can realize dynamic monitoring mission bit stream Change, the motility of Web content crawl when further lifting high concurrent.
In one embodiment of the invention, the asynchronous grasping system of the Web content can also include:Crawl pond 400.
Crawl pond 400 includes the URL of multiple Web contents to be captured.
Specifically, the corresponding driver 300 of each task queue can be by being currently needed for execution crawl of the task determined In being put into crawl pond 400.
In an embodiment of the present invention, capturing pond 400 can adopt redis to service the obstruction queuing method for providing (i.e., List data structures and brpop methods coordinate), can effectively lift crawl efficiency.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Executor 500, for from grabbing Take and read in pond 400 URL, and URL is captured.
Alternatively, in some embodiments, referring to Fig. 2, the asynchronous grasping system of the Web content also includes:
Acquisition module 500, for obtaining the mark for having captured the URL for finishing as target identification, and deletes set data knot The record information of the corresponding URL of target identification in structure.
In an embodiment of the present invention, executor 500 captures and is packaged the performance element of forwarding, executor for execution 500 quantity can be at least one.Executor 500 is blocked in crawl pond 400 by brpop methods, is monitoring crawl Pond 400 receives the URL of a web site contents to be captured, and multiple executors 500 can realize treating the URL of crawl web site contents Carry out seizing execution.Also, in an embodiment of the present invention, executor 500 can be entered because executor 500 expends resource Row distributed deployment, and due to being coordinated using list data structures and brpop methods, list data structures can be realized and be in The executor 500 of blocked state does not dispose on same main frame, it is thereby achieved that starting not on the main frame of different performance With the executor 500 of quantity, so as to realize load balancing.Executor 500 can call the website to be captured after crawl has been performed The call back function of the URL of content, the record information of the URL of the web site contents to be captured in driver 300 is deleted, so as to hold Row finishes the crawl life cycle of the URL of the web site contents to be captured.
Used as a kind of example, referring to Fig. 3, Fig. 3 is the crawl efficiency schematic diagram in the embodiment of the present invention, can be seen by Fig. 3 Go out, before 12 days November in 2015, using original asynchronous grabber, the crawl time is more than 30 minutes, and system sets Meter requires to be less than 30 minutes for the target crawl time, it is clear that original asynchronous grabber does not reach requirement of system design, gripping Can it is poor, and November 12 afterwards, after the asynchronous grasping system online operation of Web content in embodiments of the present invention, crawl Time met the target crawl time less than 30 minutes, captured improved efficiency about 20%, loaded more balanced, Analysis of Concurrency Control It is relatively more reasonable, reduce the coefficient impacts of factor such as pilot process consumption.
In the present embodiment, by the uniform resource position mark URL that Web content to be captured is read from each task queue, And URL is scheduled triggering driver according to the environmental form of URL affiliated task places rear end, read the affiliated tasks of URL Mission bit stream, task based access control information by URL injection crawl pond, and according to mission bit stream control URL injection crawl pond frequency, Mission bit stream includes query rate per second and is concurrently worth, and reads URL from crawl pond, and URL is captured, can be in height The stability of grasping system is ensured when concurrent, system resource is effectively saved, Grabbing properties are lifted.
Fig. 4 is the schematic flow sheet of the asynchronous grasping means of Web content that one embodiment of the invention is proposed.
Referring to Fig. 4, the asynchronous grasping means of the Web content includes:
S41:Obtain at least one task queue.
S42:The uniform resource position mark URL of Web content to be captured is read from each task queue, and according to URL institutes The environmental form of category task place rear end is scheduled triggering driver to URL.
In an embodiment of the present invention, the environmental form of URL affiliated task places rear end is different or identical.
In some embodiments, referring to Fig. 5, step S42 is specifically included:
S51:The uniform resource position mark URL of Web content to be captured is read from each task queue, and obtains URL institutes The environmental form of category task place rear end.
S52:Concurrent information corresponding with environmental form is obtained with concurrent corresponding relation according to environmental form.
S53:Judge the residue of environmental form is concurrently worth whether reach predetermined threshold value according to concurrent information.
S54:In remaining concurrently value not up to predetermined threshold value, triggering driver is scheduled to URL, and in residue simultaneously Value is sent out when reaching predetermined threshold value, driver is not triggered and URL is scheduled.
In the present embodiment, by being obtained with concurrent corresponding relation according to the environmental form of URL affiliated task places rear end Concurrent information corresponding with environmental form is taken, judges the residue of environmental form is concurrently worth whether reach default threshold according to concurrent information Value, in remaining concurrently value not up to predetermined threshold value, triggering driver is scheduled to URL, and reaches in remaining concurrently value During predetermined threshold value, do not trigger driver and URL is scheduled, the function that global policies are controlled can be realized, realize multiple environment The linkage of the rear end of type, strengthens the control effect of the asynchronous grasping system of Web content, Web content when effectively lifting high concurrent The motility of crawl.
S43:The mission bit stream of the affiliated tasks of URL is read, URL injections are captured pond by task based access control information, and according to task The frequency in information control URL injection crawls pond, mission bit stream includes query rate per second and is concurrently worth.
In an embodiment of the present invention, capturing pond can adopt list data structure storages URL of redis data bases.
S44:URL is read from crawl pond, and URL is captured.
In some embodiments, referring to Fig. 6, the asynchronous grasping means of the Web content also includes:
S61:The mark of URL is obtained, and the set data structures serviced based on redis will identify deposit corresponding with corresponding URL Storage, to generate the record information of URL.
S62:Acquisition has captured the mark of the URL for finishing as target identification, and deletes target identification in set data structures The record information of corresponding URL.
It should be noted that explaining the asynchronous grasping system embodiment of Web content in earlier figures 1- Fig. 3 embodiments The asynchronous grasping means of the bright Web content for being also applied for the embodiment, it realizes that principle is similar to, and here is omitted.
In the present embodiment, acquisition has captured the mark of the URL for finishing as target identification, and deletes in set data structures The record information of the corresponding URL of target identification, can effectively save memory space.
In the present embodiment, by the uniform resource position mark URL that Web content to be captured is read from each task queue, And URL is scheduled triggering driver according to the environmental form of URL affiliated task places rear end, read the affiliated tasks of URL Mission bit stream, task based access control information by URL injection crawl pond, and according to mission bit stream control URL injection crawl pond frequency, Mission bit stream includes query rate per second and is concurrently worth, and reads URL from crawl pond, and URL is captured, can be in height The stability of grasping system is ensured when concurrent, system resource is effectively saved, Grabbing properties are lifted.
It should be noted that in describing the invention, term " first ", " second " etc. are not only used for describing purpose, and not It is understood that to indicate or implying relative importance.Additionally, in describing the invention, unless otherwise stated, the implication of " multiple " It is two or more.
In flow chart or here any process described otherwise above or method description are construed as, expression includes It is one or more for realizing specific logical function or process the step of the module of code of executable instruction, fragment or portion Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussion suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realizing.For example, if realized with hardware, and in another embodiment, can be with well known in the art Any one of row technology or their combination are realizing:With for realizing the logic gates of logic function to data signal Discrete logic, the special IC with suitable combinational logic gate circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Suddenly the hardware that can be by program to instruct correlation is completed, and described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
Additionally, each functional unit in each embodiment of the invention can be integrated in a processing module, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a module.Above-mentioned integrated mould Block both can be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit is realized and as independent production marketing or when using using in the form of software function module, it is also possible to be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims (13)

1. the asynchronous grasping system of a kind of Web content, it is characterised in that include:
Job queue, for providing at least one task queue;
Scheduler, for reading the uniform resource position mark URL of Web content to be captured from each task queue, and according to institute State the environmental form of URL affiliated task places rear end and the URL is scheduled triggering driver;
Driver, after being triggered by the scheduler, reads the mission bit stream of the affiliated tasks of the URL, based on described URL injection crawls pond is controlled the frequency that the URL injects the crawl pond by business information according to the mission bit stream, The mission bit stream includes query rate per second and is concurrently worth;
Executor, for reading the URL from the crawl pond, and captures to the URL.
2. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the scheduler includes:
Read module, for reading the URL from each task queue;
Scheduler module, enters for triggering driver according to the environmental form of the URL affiliated task places rear end to the URL Row scheduling.
3. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the scheduler module includes:
First acquisition submodule, for obtaining the environmental form of the URL affiliated task places rear end;
Second acquisition submodule, for corresponding with the environmental form to obtain with concurrent corresponding relation according to environmental form Concurrent information;
Judging submodule, for judging the residue of the environmental form is concurrently worth whether reach default threshold according to the concurrent information Value;
Scheduling submodule, in the remaining concurrently value not up to predetermined threshold value, triggering the driver to described URL is scheduled, and when concurrently value reaches the predetermined threshold value to the residue, does not trigger the driver to the URL It is scheduled.
4. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the crawl pond adopts redis numbers According to storehouse list data structure storages described in URL.
5. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the driver is additionally operable to:
The mark of the URL is obtained, and the set data structures serviced based on redis identify deposit corresponding with corresponding URL by described Storage, to generate the record information of the URL.
6. the asynchronous grasping system of Web content as claimed in claim 5, it is characterised in that also include:
Acquisition module, for obtaining the mark for having captured the URL for finishing as target identification, and deletes the set data structures Described in the corresponding URL of target identification record information.
7. the asynchronous grasping system of Web content as described in claim 1 or 2 or 3, it is characterised in that the affiliated task of the URL The environmental form of place rear end is different or identical.
8. the asynchronous grasping means of a kind of Web content, it is characterised in that comprise the following steps:
Obtain at least one task queue;
The uniform resource position mark URL of Web content to be captured is read from each task queue, and is appointed according to belonging to the URL The environmental form of business place rear end is scheduled triggering driver to the URL;
The mission bit stream of the affiliated tasks of the URL is read, based on the mission bit stream by URL injection crawls pond, and according to The mission bit stream controls the frequency that the URL injects the crawl pond, and the mission bit stream is including query rate per second and concurrently Value;
The URL is read from the crawl pond, and the URL is captured.
9. the asynchronous grasping means of Web content as claimed in claim 8, it is characterised in that described to be appointed according to belonging to the URL The environmental form of business place rear end is scheduled triggering driver to the URL, including:
Obtain the environmental form of the URL affiliated task places rear end;
Concurrent information corresponding with the environmental form is obtained with concurrent corresponding relation according to environmental form;
Judge the residue of the environmental form is concurrently worth whether reach predetermined threshold value according to the concurrent information;
In the remaining concurrently value not up to predetermined threshold value, trigger the driver and the URL is scheduled, and When concurrently value reaches the predetermined threshold value to the residue, do not trigger the driver and the URL is scheduled.
10. the asynchronous grasping means of Web content as claimed in claim 8, it is characterised in that the crawl pond adopts redis numbers According to storehouse list data structure storages described in URL.
The asynchronous grasping means of 11. Web contents as claimed in claim 8, it is characterised in that also include:
The mark of the URL is obtained, and the set data structures serviced based on redis identify deposit corresponding with corresponding URL by described Storage, to generate the record information of the URL.
The asynchronous grasping means of 12. Web contents as claimed in claim 11, it is characterised in that also include:
Acquisition has captured the mark of the URL for finishing as target identification, and deletes target identification described in the set data structures The record information of corresponding URL.
The asynchronous grasping means of 13. Web contents as claimed in claim 8 or 9, it is characterised in that the affiliated task institute of the URL Environmental form in rear end is different or identical.
CN201611053534.6A 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method Active CN106599094B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611053534.6A CN106599094B (en) 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611053534.6A CN106599094B (en) 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method

Publications (2)

Publication Number Publication Date
CN106599094A true CN106599094A (en) 2017-04-26
CN106599094B CN106599094B (en) 2020-05-22

Family

ID=58591924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611053534.6A Active CN106599094B (en) 2016-11-24 2016-11-24 Asynchronous network content grabbing system and method

Country Status (1)

Country Link
CN (1) CN106599094B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291824A (en) * 2017-05-25 2017-10-24 北京小度信息科技有限公司 Data grab method and device
CN110955469A (en) * 2019-11-25 2020-04-03 中国银行股份有限公司 Method and device for online transaction by X86 platform distributed batch call

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US20110055194A1 (en) * 2009-08-26 2011-03-03 Oracle International Corporation System and Method for Asynchronous Crawling of Enterprise Applications
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US20110055194A1 (en) * 2009-08-26 2011-03-03 Oracle International Corporation System and Method for Asynchronous Crawling of Enterprise Applications
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN103559083A (en) * 2013-10-11 2014-02-05 北京奇虎科技有限公司 Web crawl task scheduling method and task scheduler

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
耿令宝: "分布式环境下的网络爬虫系统研究与优化", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
陈言等: "一种网络爬虫的带缓存非阻塞异步", 《软件导刊》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291824A (en) * 2017-05-25 2017-10-24 北京小度信息科技有限公司 Data grab method and device
CN110955469A (en) * 2019-11-25 2020-04-03 中国银行股份有限公司 Method and device for online transaction by X86 platform distributed batch call
CN110955469B (en) * 2019-11-25 2023-09-26 中国银行股份有限公司 Method and device for online transaction of distributed batch call of X86 platform

Also Published As

Publication number Publication date
CN106599094B (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN104951361B (en) A kind of triggering method and device of timed task
CN104766007B (en) A kind of method that the fast quick-recovery of sandbox is realized based on file system filter driver
Cirne et al. A comprehensive model of the supercomputer workload
CN104239139B (en) Method, device and terminal for processing boot-strap self-starting project
JP2010528396A5 (en)
CN106599094A (en) Network content asynchronous grasping system and method
EP1837771A3 (en) Monitoring of computer events
WO2006006084A8 (en) Establishing command order in an out of order dma command queue
CN104809062B (en) A kind of method of testing and system of artificial intelligence response system
CN100428209C (en) Adaptive external storage IO performance optimization method
CN105446653B (en) A kind of data merging method and equipment
CN106303710B (en) Playing list dispatching method and playing list dispatching device
CN110443126A (en) Model hyper parameter adjusts control method, device, computer equipment and storage medium
CN108280150A (en) A kind of distribution asynchronous service distribution method and system
WO2005067572A3 (en) Method, system, storage medium, and data structure for image recognition using multilinear independent component analysis
CN103593232B (en) The method for scheduling task and device of a kind of data warehouse
CN107705430A (en) A kind of man-machine interaction method, device, storage medium and automatic vending machine
CN106155794A (en) A kind of event dispatcher method being applied in multi-threaded system and device
CN109961214A (en) Complain docking processing people's distribution method, device, computer equipment and storage medium
EP1615176A3 (en) Method and system for storage and processing of data
CN108536793A (en) A kind of method and system for preventing ajax requests from repeating to submit
CN103997673B (en) A kind of event-handling method and device
CN106547614A (en) A kind of mass data based on message queue postpones deriving method
CN109710679A (en) Data pick-up method and device
US9864771B2 (en) Method and server for synchronizing a plurality of clients accessing a database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant