CN106599094A - Network content asynchronous grasping system and method - Google Patents
Network content asynchronous grasping system and method Download PDFInfo
- Publication number
- CN106599094A CN106599094A CN201611053534.6A CN201611053534A CN106599094A CN 106599094 A CN106599094 A CN 106599094A CN 201611053534 A CN201611053534 A CN 201611053534A CN 106599094 A CN106599094 A CN 106599094A
- Authority
- CN
- China
- Prior art keywords
- url
- asynchronous
- web content
- task
- grasping
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
Abstract
The invention proposes a network asynchronous grasping system and method. The network asynchronous grasping system comprises a task queue manager used for providing at least one task queue; a scheduler used for reading a uniform resource locator URL of a network content to be grasped from each task queue, and triggering a driver to schedule the URL according to the environment type of a back end where a task to which the URL belongs locates; the driver used for reading task information of the task of the URL after being triggered by the scheduler, injecting the URL into a grasping pool based on the task information, and controlling the frequency of injecting the URL into the grasping pool according to the task information, wherein the task information comprises the query per second and a concurrency value; and an actuator used for reading the URL from the grasping pool, and grasping the URL. The invention can ensures the stability of the grasping system during high concurrency, effectively save system resources and improve the grasping performance.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of asynchronous grasping system of Web content and method.
Background technology
With the development of the Internet, the Internet can include the Web content of magnanimity, under application scenes, need to adopt one
A little computer technologies extract the Web content of user's needs from the Web content of magnanimity, and the computer technology is referred to as grabbing
Take.For example, Web content can be captured by using grabber.
In correlation technique, grabber adopts Analysis of Concurrency Control, or query rate per second (Query Per Second,
QPS) control strategy, wherein, Analysis of Concurrency Control, by thread or the concurrent queue length of process independent control total amount, often
Individual process or thread synchronization perform crawl, it is ensured that queue total length is fixed, the pressure fixing to system, and QPS control strategies,
Crawl is performed by fixed frequency.
Under both modes, Control granularity is excessively extensive, and for slow back-end system, Grabbing properties are poor, it is impossible to fill
The stability of code insurance barrier Web content crawl, easily causes the avalanche effect of grasping system.
The content of the invention
It is contemplated that at least solving one of technical problem in correlation technique to a certain extent.
For this purpose, it is an object of the present invention to propose a kind of asynchronous grasping system of Web content, can be in high concurrent
The stability of grasping system is ensured, system resource is effectively saved, Grabbing properties are lifted.
Further object is that proposing a kind of asynchronous grasping means of Web content.
To reach above-mentioned purpose, the asynchronous grasping system of Web content that first aspect present invention embodiment is proposed, including:Appoint
Business queue management device, for providing at least one task queue;Scheduler, for reading net to be captured from each task queue
The uniform resource position mark URL of network content, and according to the environmental form of the URL affiliated task places rear end triggering driver
The URL is scheduled;Driver, after being triggered by the scheduler, reads the task letter of the affiliated tasks of the URL
Breath, based on the mission bit stream by URL injection crawls pond, and according to the mission bit stream controls the URL injections
The frequency in crawl pond, the mission bit stream includes query rate per second and is concurrently worth;Executor, for reading from the crawl pond
The URL, and the URL is captured.
The asynchronous grasping system of Web content that first aspect present invention embodiment is proposed, by reading from each task queue
Take the uniform resource position mark URL of Web content to be captured, and according to the environmental form of URL affiliated task places rear end triggering
Driver is scheduled to URL, reads the mission bit stream of the affiliated tasks of URL, and URL injections are captured pond by task based access control information, and
The frequency in URL injection crawls pond is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth, and from crawl
URL is read in pond, and URL is captured, the stability of grasping system can be ensured in high concurrent, effectively save system money
Source, lifts Grabbing properties.
To reach above-mentioned purpose, the asynchronous grasping means of Web content that second aspect present invention embodiment is proposed, including:Obtain
Take at least one task queue;The uniform resource position mark URL of Web content to be captured, and root are read from each task queue
Driver is triggered according to the environmental form of the URL affiliated task places rear end to be scheduled the URL;Read the URL
The mission bit stream of affiliated task, based on the mission bit stream by URL injection crawls pond, and controls according to the mission bit stream
The URL injects the frequency in the crawl pond, and the mission bit stream includes query rate per second and is concurrently worth;From the crawl pond
The URL is read, and the URL is captured.
The asynchronous grasping means of Web content that second aspect present invention embodiment is proposed, by reading from each task queue
Take the uniform resource position mark URL of Web content to be captured, and according to the environmental form of URL affiliated task places rear end triggering
Driver is scheduled to URL, reads the mission bit stream of the affiliated tasks of URL, and URL injections are captured pond by task based access control information, and
The frequency in URL injection crawls pond is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth, and from crawl
URL is read in pond, and URL is captured, the stability of grasping system can be ensured in high concurrent, effectively save system money
Source, lifts Grabbing properties.
The additional aspect of the present invention and advantage will be set forth in part in the description, and partly will become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
The above-mentioned and/or additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments
It is substantially and easy to understand, wherein:
Fig. 1 is the structural representation of the asynchronous grasping system of Web content that one embodiment of the invention is proposed;
Fig. 2 is the structural representation of the asynchronous grasping system of Web content that another embodiment of the present invention is proposed;
Fig. 3 is the crawl efficiency schematic diagram in the embodiment of the present invention;
Fig. 4 is the schematic flow sheet of the asynchronous grasping means of Web content that one embodiment of the invention is proposed;
Fig. 5 is the schematic flow sheet of the asynchronous grasping means of Web content that another embodiment of the present invention is proposed;
Fig. 6 is the schematic flow sheet of the asynchronous grasping means of Web content that another embodiment of the present invention is proposed.
Specific embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from start to finish
Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.Conversely, this
Inventive embodiment includes all changes fallen in the range of the spirit and intension of attached claims, modification and is equal to
Thing.
Fig. 1 is the structural representation of the asynchronous grasping system of Web content that one embodiment of the invention is proposed.
Referring to Fig. 1, the asynchronous grasping system of the Web content includes:Job queue 100, for providing at least one
Task queue;Scheduler 200, for reading the uniform resource position mark URL of Web content to be captured from each task queue,
And URL is scheduled triggering driver 300 according to the environmental form of URL affiliated task places rear end;Driver 300, uses
After the triggering of scheduled device 200, the mission bit stream of the affiliated tasks of URL is read, URL injections are captured pond by task based access control information
400, and the frequency in URL injection crawls pond 400 is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth;
Executor 500, for reading URL from crawl pond, and URL captured.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Job queue 100,
For providing at least one task queue.
In an embodiment of the present invention, task queue is placed in advance in job queue 100, wherein, task
Queue is at least one, comprising the uniform resource position mark URL of at least one Web content to be captured in each task queue.
In an embodiment of the present invention, before in advance task queue being put into into job queue 100, can configure
The mission bit stream of each task queue, wherein, mission bit stream can for example including ID, the task institute of the affiliated task of the task queue
Environmental form, QPS in rear end, and concurrently value of the required by task etc. is performed, further, in configuration each task is finished
After the mission bit stream of queue, can be by the tables of data in the mission bit stream write into Databasce, to be subsequently scheduled, here
It is not restricted.
In an embodiment of the present invention, the URL's in each task queue also comprising each Web content to be captured is additional
Information, for example, header (Header), to be subsequently scheduled, this is not restricted.
In an embodiment of the present invention, task queue can adopt redis to service the list data structures for providing and be treated
The press-in and ejection operation of the URL of crawl Web content, to realize the data structure of queue, for example, is needing to task queue
When middle URL is scheduled, can eject in rpop methods, when needing URL to be write in task queue, it is possible to use rpush side
Method is pressed into, easy realization simple to operate.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Scheduler 200, for from every
The uniform resource position mark URL of Web content to be captured is read in individual task queue, and according to URL affiliated task places rear end
Environmental form is scheduled triggering driver 300 to URL.
In an embodiment of the present invention, scheduler 200 can realize the function that global policies are controlled, in Web content crawl
During, scheduler 200 can travel through the URL of Web content to be captured in each task queue, and obtain the affiliated tasks of the URL
The environmental form of place rear end, is scheduled with the environmental form according to rear end, for example, can be true according to the environmental form of rear end
Surely it is currently needed for performing the task of crawl, is currently needed for stopped task, be currently needed for terminating for task, multiple rings can be realized
The linkage of the rear end of border type, strengthens the control effect of the asynchronous grasping system of Web content, when effectively lifting high concurrent in network
Hold the motility of crawl.
For example, the background server of scheduler 200 can read preset data table, wherein, can remember in the preset data table
Record each environmental form, and concurrent information corresponding with each environmental form, the concurrent information for example, after environmental form A
Total concurrently value and the monocyclic border at end is concurrently worth, wherein, always concurrently value represents that environmental form can be carried for the rear end of environmental form A
Web content to be captured URL quantity plus and, monocyclic border is concurrently worth and represents environmental form for after one of environmental form A
The quantity of the URL of the Web content to be captured that end can carry.Further, scheduler 200 is being read in network to be captured
After the environmental form of the URL affiliated task places rear end of appearance, the residue of the environmental form on current point in time can be calculated concurrent
Value, if residue is concurrently worth deficiency, will not trigger and capture the URL, in an embodiment of the present invention, due to varying environment type
Between without shared relationship, therefore, scheduler 200 can independently be dispatched to multiple tasks queue and is independent of each other, by strategy and framework
Decoupling and.
In an embodiment of the present invention, when scheduler 200 determines the URL of a Web content to be captured to be currently needed for holding
During the task of row crawl, during can further the state of the task be arranged to carry out, and start the corresponding driver of the task
300。
Alternatively, in some embodiments, referring to Fig. 2, scheduler 200 includes:
Read module 210, for reading URL from each task queue.
Scheduler module 220, for triggering driver 300 to URL according to the environmental form of URL affiliated task places rear end
It is scheduled.
Alternatively, in some embodiments, referring to Fig. 2, scheduler module 220 includes:
First acquisition submodule 221, for obtaining the environmental form of URL affiliated task places rear end.
Second acquisition submodule 222, for being obtained and environmental form pair with concurrent corresponding relation according to environmental form
The concurrent information answered.
Judging submodule 223, for judging the residue of environmental form is concurrently worth whether reach default threshold according to concurrent information
Value.
Scheduling submodule 224, in remaining concurrently value not up to predetermined threshold value, triggering driver 300 to be carried out to URL
Scheduling, and when concurrently value reaches predetermined threshold value to residue, do not trigger driver 300 and URL is scheduled.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Driver 300, for being adjusted
After degree device 200 is triggered, the mission bit stream of the affiliated tasks of URL is read, URL injections are captured pond 400, and root by task based access control information
The frequency in URL injection crawls pond 400 is controlled according to mission bit stream, mission bit stream includes query rate per second and is concurrently worth.
In an embodiment of the present invention, referring to Fig. 1, each task queue one driver 300 of correspondence, it is possible to understand that
It is, the multiple drivers 300 of multiple tasks queue correspondence.
In an embodiment of the present invention, driver 300 is the strategy controller for a task queue task, each drive
Dynamic device 300 performs the scheduling of correspondence task.Driver 300 can trigger startup by scheduler 200, be in driver 300 and open
During dynamic state, query rate QPS per second in mission bit stream can be read and be concurrently worth, and rpop methods are performed according to mission bit stream,
To dispatch the URL of Web content to be captured from task queue.
In some embodiments, driver 300 is additionally operable to:Obtain the mark of URL, and the set data knot serviced based on redis
Structure will identify storage corresponding with corresponding URL, to generate the record information of URL.
In an embodiment of the present invention, driver 300 can be sent to the URL after the record information for generating URL
Crawl pond 400, after often returning a crawl result, the call back function of worker can be from set data structures by the record of the URL
Information deletion, can effectively save memory space.
In an embodiment of the present invention, driver 300 controls the frequency in URL injection crawls pond 400 according to mission bit stream, appoints
Business information includes query rate QPS per second and being concurrently worth, ensure that single web site contents to be captured URL it is concurrent controllable.It is logical
The QPS of the URL for controlling single web site contents to be captured is crossed, the crawl strategy of the URL of single web site contents to be captured is realized.
In an embodiment of the present invention, during the URL to single web site contents to be captured is captured, driver
300 can scan the mission bit stream for reading the affiliated tasks of URL on Preset Time point, can realize dynamic monitoring mission bit stream
Change, the motility of Web content crawl when further lifting high concurrent.
In one embodiment of the invention, the asynchronous grasping system of the Web content can also include:Crawl pond 400.
Crawl pond 400 includes the URL of multiple Web contents to be captured.
Specifically, the corresponding driver 300 of each task queue can be by being currently needed for execution crawl of the task determined
In being put into crawl pond 400.
In an embodiment of the present invention, capturing pond 400 can adopt redis to service the obstruction queuing method for providing (i.e.,
List data structures and brpop methods coordinate), can effectively lift crawl efficiency.
In one embodiment of the invention, the asynchronous grasping system of the Web content includes:Executor 500, for from grabbing
Take and read in pond 400 URL, and URL is captured.
Alternatively, in some embodiments, referring to Fig. 2, the asynchronous grasping system of the Web content also includes:
Acquisition module 500, for obtaining the mark for having captured the URL for finishing as target identification, and deletes set data knot
The record information of the corresponding URL of target identification in structure.
In an embodiment of the present invention, executor 500 captures and is packaged the performance element of forwarding, executor for execution
500 quantity can be at least one.Executor 500 is blocked in crawl pond 400 by brpop methods, is monitoring crawl
Pond 400 receives the URL of a web site contents to be captured, and multiple executors 500 can realize treating the URL of crawl web site contents
Carry out seizing execution.Also, in an embodiment of the present invention, executor 500 can be entered because executor 500 expends resource
Row distributed deployment, and due to being coordinated using list data structures and brpop methods, list data structures can be realized and be in
The executor 500 of blocked state does not dispose on same main frame, it is thereby achieved that starting not on the main frame of different performance
With the executor 500 of quantity, so as to realize load balancing.Executor 500 can call the website to be captured after crawl has been performed
The call back function of the URL of content, the record information of the URL of the web site contents to be captured in driver 300 is deleted, so as to hold
Row finishes the crawl life cycle of the URL of the web site contents to be captured.
Used as a kind of example, referring to Fig. 3, Fig. 3 is the crawl efficiency schematic diagram in the embodiment of the present invention, can be seen by Fig. 3
Go out, before 12 days November in 2015, using original asynchronous grabber, the crawl time is more than 30 minutes, and system sets
Meter requires to be less than 30 minutes for the target crawl time, it is clear that original asynchronous grabber does not reach requirement of system design, gripping
Can it is poor, and November 12 afterwards, after the asynchronous grasping system online operation of Web content in embodiments of the present invention, crawl
Time met the target crawl time less than 30 minutes, captured improved efficiency about 20%, loaded more balanced, Analysis of Concurrency Control
It is relatively more reasonable, reduce the coefficient impacts of factor such as pilot process consumption.
In the present embodiment, by the uniform resource position mark URL that Web content to be captured is read from each task queue,
And URL is scheduled triggering driver according to the environmental form of URL affiliated task places rear end, read the affiliated tasks of URL
Mission bit stream, task based access control information by URL injection crawl pond, and according to mission bit stream control URL injection crawl pond frequency,
Mission bit stream includes query rate per second and is concurrently worth, and reads URL from crawl pond, and URL is captured, can be in height
The stability of grasping system is ensured when concurrent, system resource is effectively saved, Grabbing properties are lifted.
Fig. 4 is the schematic flow sheet of the asynchronous grasping means of Web content that one embodiment of the invention is proposed.
Referring to Fig. 4, the asynchronous grasping means of the Web content includes:
S41:Obtain at least one task queue.
S42:The uniform resource position mark URL of Web content to be captured is read from each task queue, and according to URL institutes
The environmental form of category task place rear end is scheduled triggering driver to URL.
In an embodiment of the present invention, the environmental form of URL affiliated task places rear end is different or identical.
In some embodiments, referring to Fig. 5, step S42 is specifically included:
S51:The uniform resource position mark URL of Web content to be captured is read from each task queue, and obtains URL institutes
The environmental form of category task place rear end.
S52:Concurrent information corresponding with environmental form is obtained with concurrent corresponding relation according to environmental form.
S53:Judge the residue of environmental form is concurrently worth whether reach predetermined threshold value according to concurrent information.
S54:In remaining concurrently value not up to predetermined threshold value, triggering driver is scheduled to URL, and in residue simultaneously
Value is sent out when reaching predetermined threshold value, driver is not triggered and URL is scheduled.
In the present embodiment, by being obtained with concurrent corresponding relation according to the environmental form of URL affiliated task places rear end
Concurrent information corresponding with environmental form is taken, judges the residue of environmental form is concurrently worth whether reach default threshold according to concurrent information
Value, in remaining concurrently value not up to predetermined threshold value, triggering driver is scheduled to URL, and reaches in remaining concurrently value
During predetermined threshold value, do not trigger driver and URL is scheduled, the function that global policies are controlled can be realized, realize multiple environment
The linkage of the rear end of type, strengthens the control effect of the asynchronous grasping system of Web content, Web content when effectively lifting high concurrent
The motility of crawl.
S43:The mission bit stream of the affiliated tasks of URL is read, URL injections are captured pond by task based access control information, and according to task
The frequency in information control URL injection crawls pond, mission bit stream includes query rate per second and is concurrently worth.
In an embodiment of the present invention, capturing pond can adopt list data structure storages URL of redis data bases.
S44:URL is read from crawl pond, and URL is captured.
In some embodiments, referring to Fig. 6, the asynchronous grasping means of the Web content also includes:
S61:The mark of URL is obtained, and the set data structures serviced based on redis will identify deposit corresponding with corresponding URL
Storage, to generate the record information of URL.
S62:Acquisition has captured the mark of the URL for finishing as target identification, and deletes target identification in set data structures
The record information of corresponding URL.
It should be noted that explaining the asynchronous grasping system embodiment of Web content in earlier figures 1- Fig. 3 embodiments
The asynchronous grasping means of the bright Web content for being also applied for the embodiment, it realizes that principle is similar to, and here is omitted.
In the present embodiment, acquisition has captured the mark of the URL for finishing as target identification, and deletes in set data structures
The record information of the corresponding URL of target identification, can effectively save memory space.
In the present embodiment, by the uniform resource position mark URL that Web content to be captured is read from each task queue,
And URL is scheduled triggering driver according to the environmental form of URL affiliated task places rear end, read the affiliated tasks of URL
Mission bit stream, task based access control information by URL injection crawl pond, and according to mission bit stream control URL injection crawl pond frequency,
Mission bit stream includes query rate per second and is concurrently worth, and reads URL from crawl pond, and URL is captured, can be in height
The stability of grasping system is ensured when concurrent, system resource is effectively saved, Grabbing properties are lifted.
It should be noted that in describing the invention, term " first ", " second " etc. are not only used for describing purpose, and not
It is understood that to indicate or implying relative importance.Additionally, in describing the invention, unless otherwise stated, the implication of " multiple "
It is two or more.
In flow chart or here any process described otherwise above or method description are construed as, expression includes
It is one or more for realizing specific logical function or process the step of the module of code of executable instruction, fragment or portion
Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussion suitable
Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned
In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage
Or firmware is realizing.For example, if realized with hardware, and in another embodiment, can be with well known in the art
Any one of row technology or their combination are realizing:With for realizing the logic gates of logic function to data signal
Discrete logic, the special IC with suitable combinational logic gate circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried
Suddenly the hardware that can be by program to instruct correlation is completed, and described program can be stored in a kind of computer-readable storage medium
In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
Additionally, each functional unit in each embodiment of the invention can be integrated in a processing module, it is also possible to
It is that unit is individually physically present, it is also possible to which two or more units are integrated in a module.Above-mentioned integrated mould
Block both can be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as
Fruit is realized and as independent production marketing or when using using in the form of software function module, it is also possible to be stored in a computer
In read/write memory medium.
Storage medium mentioned above can be read only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means to combine specific features, structure, material or spy that the embodiment or example are described
Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not
Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any
One or more embodiments or example in combine in an appropriate manner.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example
Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, changes, replacing and modification.
Claims (13)
1. the asynchronous grasping system of a kind of Web content, it is characterised in that include:
Job queue, for providing at least one task queue;
Scheduler, for reading the uniform resource position mark URL of Web content to be captured from each task queue, and according to institute
State the environmental form of URL affiliated task places rear end and the URL is scheduled triggering driver;
Driver, after being triggered by the scheduler, reads the mission bit stream of the affiliated tasks of the URL, based on described
URL injection crawls pond is controlled the frequency that the URL injects the crawl pond by business information according to the mission bit stream,
The mission bit stream includes query rate per second and is concurrently worth;
Executor, for reading the URL from the crawl pond, and captures to the URL.
2. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the scheduler includes:
Read module, for reading the URL from each task queue;
Scheduler module, enters for triggering driver according to the environmental form of the URL affiliated task places rear end to the URL
Row scheduling.
3. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the scheduler module includes:
First acquisition submodule, for obtaining the environmental form of the URL affiliated task places rear end;
Second acquisition submodule, for corresponding with the environmental form to obtain with concurrent corresponding relation according to environmental form
Concurrent information;
Judging submodule, for judging the residue of the environmental form is concurrently worth whether reach default threshold according to the concurrent information
Value;
Scheduling submodule, in the remaining concurrently value not up to predetermined threshold value, triggering the driver to described
URL is scheduled, and when concurrently value reaches the predetermined threshold value to the residue, does not trigger the driver to the URL
It is scheduled.
4. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the crawl pond adopts redis numbers
According to storehouse list data structure storages described in URL.
5. the asynchronous grasping system of Web content as claimed in claim 1, it is characterised in that the driver is additionally operable to:
The mark of the URL is obtained, and the set data structures serviced based on redis identify deposit corresponding with corresponding URL by described
Storage, to generate the record information of the URL.
6. the asynchronous grasping system of Web content as claimed in claim 5, it is characterised in that also include:
Acquisition module, for obtaining the mark for having captured the URL for finishing as target identification, and deletes the set data structures
Described in the corresponding URL of target identification record information.
7. the asynchronous grasping system of Web content as described in claim 1 or 2 or 3, it is characterised in that the affiliated task of the URL
The environmental form of place rear end is different or identical.
8. the asynchronous grasping means of a kind of Web content, it is characterised in that comprise the following steps:
Obtain at least one task queue;
The uniform resource position mark URL of Web content to be captured is read from each task queue, and is appointed according to belonging to the URL
The environmental form of business place rear end is scheduled triggering driver to the URL;
The mission bit stream of the affiliated tasks of the URL is read, based on the mission bit stream by URL injection crawls pond, and according to
The mission bit stream controls the frequency that the URL injects the crawl pond, and the mission bit stream is including query rate per second and concurrently
Value;
The URL is read from the crawl pond, and the URL is captured.
9. the asynchronous grasping means of Web content as claimed in claim 8, it is characterised in that described to be appointed according to belonging to the URL
The environmental form of business place rear end is scheduled triggering driver to the URL, including:
Obtain the environmental form of the URL affiliated task places rear end;
Concurrent information corresponding with the environmental form is obtained with concurrent corresponding relation according to environmental form;
Judge the residue of the environmental form is concurrently worth whether reach predetermined threshold value according to the concurrent information;
In the remaining concurrently value not up to predetermined threshold value, trigger the driver and the URL is scheduled, and
When concurrently value reaches the predetermined threshold value to the residue, do not trigger the driver and the URL is scheduled.
10. the asynchronous grasping means of Web content as claimed in claim 8, it is characterised in that the crawl pond adopts redis numbers
According to storehouse list data structure storages described in URL.
The asynchronous grasping means of 11. Web contents as claimed in claim 8, it is characterised in that also include:
The mark of the URL is obtained, and the set data structures serviced based on redis identify deposit corresponding with corresponding URL by described
Storage, to generate the record information of the URL.
The asynchronous grasping means of 12. Web contents as claimed in claim 11, it is characterised in that also include:
Acquisition has captured the mark of the URL for finishing as target identification, and deletes target identification described in the set data structures
The record information of corresponding URL.
The asynchronous grasping means of 13. Web contents as claimed in claim 8 or 9, it is characterised in that the affiliated task institute of the URL
Environmental form in rear end is different or identical.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611053534.6A CN106599094B (en) | 2016-11-24 | 2016-11-24 | Asynchronous network content grabbing system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611053534.6A CN106599094B (en) | 2016-11-24 | 2016-11-24 | Asynchronous network content grabbing system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599094A true CN106599094A (en) | 2017-04-26 |
CN106599094B CN106599094B (en) | 2020-05-22 |
Family
ID=58591924
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611053534.6A Active CN106599094B (en) | 2016-11-24 | 2016-11-24 | Asynchronous network content grabbing system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599094B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291824A (en) * | 2017-05-25 | 2017-10-24 | 北京小度信息科技有限公司 | Data grab method and device |
CN110955469A (en) * | 2019-11-25 | 2020-04-03 | 中国银行股份有限公司 | Method and device for online transaction by X86 platform distributed batch call |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6377984B1 (en) * | 1999-11-02 | 2002-04-23 | Alta Vista Company | Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue |
US20110055194A1 (en) * | 2009-08-26 | 2011-03-03 | Oracle International Corporation | System and Method for Asynchronous Crawling of Enterprise Applications |
CN102184227A (en) * | 2011-05-10 | 2011-09-14 | 北京邮电大学 | General crawler engine system used for WEB service and working method thereof |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
-
2016
- 2016-11-24 CN CN201611053534.6A patent/CN106599094B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6377984B1 (en) * | 1999-11-02 | 2002-04-23 | Alta Vista Company | Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue |
US20110055194A1 (en) * | 2009-08-26 | 2011-03-03 | Oracle International Corporation | System and Method for Asynchronous Crawling of Enterprise Applications |
CN102184227A (en) * | 2011-05-10 | 2011-09-14 | 北京邮电大学 | General crawler engine system used for WEB service and working method thereof |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN103559083A (en) * | 2013-10-11 | 2014-02-05 | 北京奇虎科技有限公司 | Web crawl task scheduling method and task scheduler |
Non-Patent Citations (2)
Title |
---|
耿令宝: "分布式环境下的网络爬虫系统研究与优化", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
陈言等: "一种网络爬虫的带缓存非阻塞异步", 《软件导刊》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291824A (en) * | 2017-05-25 | 2017-10-24 | 北京小度信息科技有限公司 | Data grab method and device |
CN110955469A (en) * | 2019-11-25 | 2020-04-03 | 中国银行股份有限公司 | Method and device for online transaction by X86 platform distributed batch call |
CN110955469B (en) * | 2019-11-25 | 2023-09-26 | 中国银行股份有限公司 | Method and device for online transaction of distributed batch call of X86 platform |
Also Published As
Publication number | Publication date |
---|---|
CN106599094B (en) | 2020-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104951361B (en) | A kind of triggering method and device of timed task | |
CN104766007B (en) | A kind of method that the fast quick-recovery of sandbox is realized based on file system filter driver | |
Cirne et al. | A comprehensive model of the supercomputer workload | |
CN104239139B (en) | Method, device and terminal for processing boot-strap self-starting project | |
JP2010528396A5 (en) | ||
CN106599094A (en) | Network content asynchronous grasping system and method | |
EP1837771A3 (en) | Monitoring of computer events | |
WO2006006084A8 (en) | Establishing command order in an out of order dma command queue | |
CN104809062B (en) | A kind of method of testing and system of artificial intelligence response system | |
CN100428209C (en) | Adaptive external storage IO performance optimization method | |
CN105446653B (en) | A kind of data merging method and equipment | |
CN106303710B (en) | Playing list dispatching method and playing list dispatching device | |
CN110443126A (en) | Model hyper parameter adjusts control method, device, computer equipment and storage medium | |
CN108280150A (en) | A kind of distribution asynchronous service distribution method and system | |
WO2005067572A3 (en) | Method, system, storage medium, and data structure for image recognition using multilinear independent component analysis | |
CN103593232B (en) | The method for scheduling task and device of a kind of data warehouse | |
CN107705430A (en) | A kind of man-machine interaction method, device, storage medium and automatic vending machine | |
CN106155794A (en) | A kind of event dispatcher method being applied in multi-threaded system and device | |
CN109961214A (en) | Complain docking processing people's distribution method, device, computer equipment and storage medium | |
EP1615176A3 (en) | Method and system for storage and processing of data | |
CN108536793A (en) | A kind of method and system for preventing ajax requests from repeating to submit | |
CN103997673B (en) | A kind of event-handling method and device | |
CN106547614A (en) | A kind of mass data based on message queue postpones deriving method | |
CN109710679A (en) | Data pick-up method and device | |
US9864771B2 (en) | Method and server for synchronizing a plurality of clients accessing a database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |