CN108268498A - The treating method and apparatus of batch reptile task - Google Patents
The treating method and apparatus of batch reptile task Download PDFInfo
- Publication number
- CN108268498A CN108268498A CN201611261546.8A CN201611261546A CN108268498A CN 108268498 A CN108268498 A CN 108268498A CN 201611261546 A CN201611261546 A CN 201611261546A CN 108268498 A CN108268498 A CN 108268498A
- Authority
- CN
- China
- Prior art keywords
- network address
- crawl
- reptile task
- task
- reptile
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind for the treatment of method and apparatus of batch reptile task.Wherein, this method includes:Obtaining, there are same configuration the multiple of information to crawl network address;Reptile task is generated based on configuration information and multiple network address that crawl;Reptile task is injected into crawlers;Perform reptile task using crawlers, obtain it is multiple crawl network address crawl result.The present invention solves the technical issues of efficiency for handling and there is the batch for similarly configuring information to crawl task is low.
Description
Technical field
The present invention relates to reptile task process field, in particular to a kind of processing method of batch reptile task and
Device.
Background technology
In existing interim crawler technology, before each website (Url, network resources address) is crawled, it can set
It is relevant to crawl configuration information, such as crawl depth, crawl number of pages and whether be catalogue page configuration information.In crawlers
Will according to these configuration informations carry out crawl processing accordingly, most crawl at last this website result store in the database or
In some file of person.But such one network address of process that crawls only corresponds to a task, and a task only corresponds to a storage knot
Fruit.When a batch have same configuration information task need to crawl when, then repetitive operation is needed to inject this process, this for
For the user of reptile, efficiency is lower.In addition, if user needs multiple same configuration information crawling task
It is stored in same result, operation at this time is with regard to comparable cumbersome:Firstly the need of the unique mark for finding each task, then
The result that crawls of these tasks is parsed again, is finally integrated into one as a result, this obvious efficiency of mode that crawls is than relatively low
Under.
For above-mentioned processing have same configuration information batch crawl task efficiency it is low the problem of, not yet propose at present
Effective solution.
Invention content
An embodiment of the present invention provides a kind for the treatment of method and apparatus of batch reptile task, have at least to solve processing
Same configuration information batch crawl task efficiency it is low the technical issues of.
One side according to embodiments of the present invention provides a kind of processing method of batch reptile task, this method packet
It includes:Obtaining, there are same configuration the multiple of information to crawl network address;Reptile task is generated based on configuration information and multiple network address that crawl;
Reptile task is injected into crawlers;Perform reptile task using crawlers, obtain it is multiple crawl network address crawl result.
Further, the reptile task is being performed using the crawlers, is obtaining the multiple crawling climbing for network address
After taking result, the method further includes:The multiple result that crawls for crawling network address is recorded in same destination file.
Further, the network address that crawls corresponds to a mark, it is described by it is the multiple crawl network address crawl knot
Fruit is recorded in same destination file and includes:It is crawled according to each corresponding mark of network address that crawls by the multiple
The result that crawls of network address is recorded in same destination file.
Further, included based on the configuration information and the multiple network address generation reptile task that crawls:Match by described in
Confidence ceases the configuration information as reptile task;The multiple network address that crawls is injected in the specific field of the reptile task,
The specific field is element set, and the element set includes multiple element objects, and each element object is used to preserve
Inject the reptile task one crawls network address.
Further, the reptile task is performed using the crawlers, obtains the multiple crawling crawling for network address
As a result include:The reptile task is split into multiple subtasks, wherein, each subtask corresponds to and crawls net described in one
Location;Perform the multiple subtask, obtain it is the multiple crawl network address crawl result.
Further, reptile task injection crawlers are included:The reptile task is serialized to obtain
Mission bit stream, and the mission bit stream is injected into the crawlers;And the reptile task is split into multiple subtasks
Including:By the mission bit stream unserializing in the crawlers into the reptile task;It is corresponding based on the reptile task
Network address being crawled described in multiple, the reptile task being split into multiple subtasks, each subtask is made to correspond to an institute
It states and crawls network address.
Further, the multiple subtask is performed, the multiple result that crawls for crawling network address is obtained and includes:To each
The subtask is serialized, and each subtask after serializing is sent to be pre-created crawl message team
Row;Start the crawlers, each subtask in message queue is crawled described in execution, obtains each subtask
It is corresponding crawl network address crawl result.
Another aspect according to embodiments of the present invention additionally provides a kind of processing unit of batch reptile task, the device
It can include:First acquisition unit for obtaining there are same configuration the multiple of information to crawl network address;Generation unit, for base
Reptile task is generated in configuration information and multiple network address that crawl;Injection unit, for reptile task to be injected crawlers;It performs
Unit, for performing reptile task using crawlers, obtain it is multiple crawl network address crawl result.
Further, the processing unit includes:Recording unit, for performing the reptile using the crawlers
Task, obtain it is the multiple crawl network address crawl result after, the multiple result that crawls for crawling network address is recorded in together
In one destination file.
Further, the recording unit includes:Logging modle, it is corresponding described for crawling network address according to each
The multiple result that crawls for crawling network address is recorded in same destination file by mark.
In embodiments of the present invention, can get with same configuration information it is multiple crawl network address after, be based on
Configuration information and multiple network address that crawl generate reptile task;Crawlers will be injected comprising multiple reptile tasks for crawling network address,
So as to which multiple subtasks are injected crawlers together, without singly injecting crawlers, realize and be directed to
The batch for the crawling network address injection of multiple same configuration information substantially increases the injection efficiency of task.Further, it will climb
Worm task implantation crawlers after, using crawlers perform reptile task, obtain it is multiple crawl network address crawl as a result, from
And directly the result that crawls for crawling network address with same configuration information can be recorded in same destination file, operation letter
Single, operating efficiency is high.It solves batch of the processing in the prior art with same configuration information by above-described embodiment and crawls and appoint
The problem of efficiency of business is low realizes the effect for improving the efficiency that batch of the processing with same configuration information crawls task.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair
Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of batch reptile task according to embodiments of the present invention;
Fig. 2 is the data flow figure of the processing method of batch reptile task according to embodiments of the present invention;
Fig. 3 is the schematic diagram of the processing unit of batch reptile task according to embodiments of the present invention.
Specific embodiment
In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention
The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects
It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, "
Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way
Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment
Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product
Or the intrinsic other steps of equipment or unit.
According to embodiments of the present invention, a kind of embodiment of the method for the processing method of batch reptile task is provided, is needed
Bright, step shown in the flowchart of the accompanying drawings can be held in the computer system of such as a group of computer-executable instructions
Row, although also, show logical order in flow charts, it in some cases, can be to be different from sequence herein
Perform shown or described step.
Fig. 1 is the flow chart of the processing method of batch reptile task according to embodiments of the present invention, as shown in Figure 1, the party
Method includes the following steps:
Step S102, obtaining, there are same configuration the multiple of information to crawl network address;
Step S104 generates reptile task based on configuration information and multiple network address that crawl;
Reptile task is injected crawlers by step S106;
Step S108 performs reptile task using crawlers, obtain it is multiple crawl network address crawl result.By above-mentioned
Step, can get with same configuration information it is multiple crawl network address after, based on configuration information and multiple crawl net
Location generates reptile task, and the identical multiple corresponding reptile task batches of network address that crawl of configuration information are injected crawlers, and
Without singly injecting, the injection efficiency of reptile task is substantially increased.
Embodiment according to figure 1 after step S108 is performed, can also be performed step S110, be crawled multiple
The result that crawls of network address is recorded in same destination file.
Optionally, the network address that crawls corresponds to a mark, it is described by it is the multiple crawl network address crawl result
Same destination file is recorded in include:Net is crawled by the multiple according to each corresponding mark of network address that crawls
The result that crawls of location is recorded in same destination file.By above-described embodiment, can result be crawled with fast recording.
Further, after reptile task is injected crawlers, reptile task is performed using crawlers, is obtained more
It is a to crawl the crawling as a result, multiple results that crawl for crawling network address are recorded in destination file of network address, so as to directly will
The result that crawls for crawling network address with same configuration information is recorded in same destination file, easy to operate, operating efficiency
It is high.It is low to solve the efficiency for handling in the prior art and there is the batch for similarly configuring information to crawl task by above-described embodiment
Problem realizes the effect for improving the efficiency that batch of the processing with same configuration information crawls task.
In the above-described embodiments, each reptile task can set configuration information before crawling, these information can be added
Be added in the respective attributes field of reptile task, in the above-described embodiments, based on configuration information and it is multiple crawl network address generation climb
Worm task, can be using configuration information as the configuration information of reptile task;The specified of network address injection reptile task is crawled by multiple
In field, specific field is element set, and element set includes multiple element objects, and each element object is climbed for preserving injection
One of worm task crawls network address.
Specifically, each reptile task can crawl information before crawling to its configuration, these configuration informations can be added
Be added in the respective attributes (or field) of reptile task, based on it is multiple crawl network address generation reptile task when, can will be configured
Information is added into corresponding field, and by multiple specific field Url List for crawling network address and being added to reptile task, the field
It is the general type set of a List (i.e. above-mentioned element set), each element of the element set corresponds to a Url object.
In the above-described embodiments, reptile task can be injected by WebApi modules (Web page application program programming interface)
Into crawlers.
Reptile task injection crawlers are included:Reptile task is serialized to obtain mission bit stream, and by described in
Mission bit stream injects the crawlers.Specifically, the information of reptile task can serialize using device is serialized
To mission bit stream, to inject the identical multiple tasks information of configuration information in batches.
Further, reptile task is performed using crawlers, obtains multiple results that crawl for crawling network address and include:It will climb
Mission bit stream unserializing in worm program is into reptile task;Based on the corresponding multiple network address that crawl of reptile task by reptile task
Multiple subtasks are split into, each subtask is made to correspond to one and crawls network address.In the above-described embodiments, reptile task is split into
Multiple subtasks, wherein, each subtask corresponds to one and crawls network address;It performs multiple subtasks, obtains multiple network address that crawl
Crawl result.
Specifically, reptile task is split into multiple subtasks to include:By mission bit stream unserializing into reptile task;It will
Multiple element objects in the element set of reptile task split into multiple subtasks.
In the interim reptile framework of crawlers, there are one serializing device (the serializing devices of interim reptile mission bit stream
Belong to a processing module of monitoring agent device), it is mainly used for serializing the information in reptile task and unserializing.
It, can be by each Url in the UrlList set of fields of reptile task when the data for serializing device serializing injection reptile task
For analysis of object into being put into a Json array (JArray) after character string, the Key values of this Json array are url;When task is anti-
It, can be by content being added in the UrlList attribute sets of reptile task one by one that Key values are url during serializing.
According to the abovementioned embodiments of the present invention, multiple subtasks are performed, multiple results that crawl for crawling network address is obtained and includes:
Each subtask is serialized, and each subtask after serializing is sent to be pre-created crawl message queue;
Start crawlers, perform and crawl each subtask in message queue, obtain that each subtask is corresponding to crawl climbing for network address
Take result.By above-described embodiment, unserializing can be carried out to each subtask, the subtask after unserializing is sent to
Crawl message queue;Start crawlers processes, preprocessor process and task termination program process, to each subtask into
Row data crawl, obtain each subtask it is corresponding crawl network address crawl result.
The above embodiment of the present invention is described in detail with reference to Fig. 2, as shown in Fig. 2, reptile task injects reptile by Web Api
Program in the interim reptile framework of the crawlers, is provided with monitoring agent device and reptile performs program.
Wherein, monitoring agent device includes:Monitoring agent device message queue, reptile Message Queuing resource management module, reptile
Process manager module and Zookeeper (distributed application program coordination service) management module.
Monitoring agent device message queue is mainly used for receiving the reptile task of injection;Reptile Message Queuing resource management module
It is mainly used for creating the reptile task message queue of reptile task, e.g., establishment crawls message queue (Crawling Queue), pre-
Handling message queue (Preparation Queue), (in this embodiment, the link for crawling network address of extraction can be placed in pre- place
It manages in message queue, preprocessor can read link from pretreatment message queue, and after pretreatment is completed, link is thrown back
Crawl message queue), ending message queue (that is, storage message queue, Storage Queue);Spidering process management module
For carrying out the startup of reptile task associated process, destruction and monitoring;Zookeeper management modules are for current to reptile task
Execution state carry out record monitoring.
In the above-described embodiments, monitoring agent device is equivalent to the resource of a total activation device, management and monitoring various aspects.
In interim reptile framework, the reptile task of injection is serialized to obtain mission bit stream by serializing device, by reptile task
Crawlers are injected, monitoring agent device message queue is after the mission bit stream of reptile task of injection is received, monitoring agent device
Reptile Message Queuing resource management module will be called to create three above-mentioned message queues, monitoring agent device from monitoring generation
It manages after device message queue obtains message, serializing device is by the mission bit stream unserializing in monitoring agent device message queue into climbing
Worm task, then Url one by one (i.e. above-mentioned crawls address) object in UrlList element sets in reptile task is torn open again
Be divided into reptile task new one by one (i.e. above-mentioned multiple subtasks), the configuration information of each new task and subtask it is unique
Identify the identical of the reptile task with most starting injection.Then it by such a new reptile task sequence and sends
It is successfully crawled in message queue to having created.
As shown in Fig. 2, the reptile task after splitting is injected into after crawling message queue, monitoring agent device module will open
3 processes of dynamic reptile:Crawlers process, preprocessor process and task termination program process carry out data and crawl, and have
Body, crawlers process read message from crawling in message queue, preprocessor process is read from pretreatment message queue
Message is taken, task termination program process reads message from ending message queue.After the completion of crawling, network address is crawled according to each
Multiple results that crawl for crawling network address are recorded in same destination file by corresponding mark (task ID number).
Wherein, it serializes:It is the process that data structure or object are converted into binary string;Unserializing:Being will be in sequence
Binary string generated in row process is converted into data structure or the process of object.
By above-described embodiment, multiple reptile tasks with same configuration information are climbed as a reptile task injection
Worm program realizes intelligent batch injection and crawls task, solves multiple batches for crawling task with same configuration information
Amount injection, makes the more diversification and intelligence of reptile task injection mode, greatly improves task injection efficiency.
Another aspect according to embodiments of the present invention additionally provides a kind of processing unit of batch reptile task, such as Fig. 3 institutes
Show, which can include:First acquisition unit 31 for obtaining there are same configuration the multiple of information to crawl network address;Generation
Unit 33 generates reptile task for being based on configuration information and multiple network address that crawl;Injection unit 35, for reptile task to be noted
Enter crawlers;Execution unit 37, for performing reptile task using crawlers, obtain it is multiple crawl network address crawl knot
Fruit.
By above-mentioned steps, can get with same configuration information it is multiple crawl network address after, based on configuration
Information and multiple network address that crawl generate reptile task;Crawlers will be injected comprising multiple reptile tasks for crawling network address, so as to
Crawlers can be injected in multiple subtasks together, without singly injecting crawlers, realize for multiple
The batch for the crawling network address injection of same configuration information substantially increases the injection efficiency of task.Further, appoint by reptile
After business implantation crawlers, reptile task is performed using crawlers, obtains multiple crawling the crawling as a result, by multiple of network address
The result that crawls for crawling network address is recorded in destination file, so as to directly crawl network address with same configuration information
It crawls result to be recorded in same destination file, easy to operate, operating efficiency is high.Solves existing skill by above-described embodiment
Cardia have the batch of same configuration information crawl task efficiency it is low the problem of, realize raising processing and have and identical match
The batch of confidence breath crawls the effect of the efficiency of task.
In the above-described embodiments, which further includes recording unit 39, for performing reptile task using crawlers,
Obtain it is multiple crawl network address crawl result after, multiple results that crawl for crawling network address are recorded in same destination file
In.
Wherein, recording unit includes:Logging modle, for crawling net by multiple according to each corresponding mark of network address that crawls
The result that crawls of location is recorded in same destination file.
Further, generation unit includes:Determining module, for the configuration information using configuration information as reptile task;
First injection module, for by multiple specific fields for crawling network address injection reptile task, specific field to be element set, member
Element set includes multiple element objects, and each element object crawls network address for preserving one of injection reptile task.
Specifically, each reptile task can crawl information before crawling to its configuration, these configuration informations can be added
Be added in the respective attributes (or field) of reptile task, based on it is multiple crawl network address generation reptile task when, can will be configured
Information is added into corresponding field, and by multiple specific field UrlList for crawling network address and being added to reptile task, which is
The general type set of one List (i.e. above-mentioned element set), each element of the element set correspond to a Url object.
In the above-described embodiments, task can be injected into and climbed by WebApi modules (Web page application program programming interface)
In worm.
Further, execution unit includes:Module is split, for reptile task to be split into multiple subtasks, wherein, often
A subtask corresponds to one and crawls network address;Execution module, for performing multiple subtasks, obtain it is multiple crawl network address crawl knot
Fruit.
Further, injection unit includes:Injection module, for being serialized to obtain mission bit stream to reptile task,
And mission bit stream is injected into crawlers.Module is split to include:Unserializing module, for by the mission bit stream in crawlers
Unserializing is into reptile task;Submodule is split, tears reptile task open for being based on the corresponding multiple network address that crawl of reptile task
It is divided into multiple subtasks, each subtask is made to correspond to one and crawls network address.
In the interim reptile framework of crawlers, there are one serializing device (the serializing devices of interim reptile mission bit stream
Belong to a processing module of monitoring agent device), it is mainly used for serializing the information in reptile task and unserializing.
It, can be by each Url in the UrlList set of fields for injecting task when the data for serializing device serializing injection reptile task
For analysis of object into being put into a Json array (JArray) after character string, the Key values of this Json array are url;When task is anti-
It, can be by content being added in the UrlList attribute sets of reptile task one by one that Key values are url during serializing.
Further, execution module is specifically used for:Each subtask is serialized, and by each height after serializing
Task be sent to be pre-created crawl message queue;Start crawlers, perform each subtask crawled in message queue,
Obtain each subtask it is corresponding crawl network address crawl result.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment
The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or
Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially
The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products
It embodies, which is stored in a storage medium, is used including some instructions so that a computer
Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or
Part steps.And aforementioned storage medium includes:USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code
Medium.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of processing method of batch reptile task, which is characterized in that including:
Obtaining, there are same configuration the multiple of information to crawl network address;
Reptile task is generated based on the configuration information and the multiple network address that crawls;
The reptile task is injected into crawlers;
Perform the reptile task using the crawlers, obtain it is the multiple crawl network address crawl result.
2. processing method according to claim 1, which is characterized in that appoint performing the reptile using the crawlers
Business, obtain it is the multiple crawl network address crawl result after, the method further includes:Crawling for network address is crawled by the multiple
As a result it is recorded in same destination file.
3. processing method according to claim 2, which is characterized in that each network address that crawls corresponds to a mark, institute
It states and the multiple result that crawls for crawling network address is recorded in same destination file and includes:
According to it is each it is described crawl network address it is corresponding it is described mark by it is the multiple crawl network address crawl result be recorded in it is same
In a destination file.
4. according to claim 1-3 any one of them processing methods, which is characterized in that based on the configuration information and described more
A network address generation reptile task that crawls includes:
Using the configuration information as the configuration information of reptile task;
The multiple network address that crawls is injected in the specific field of the reptile task, the specific field is element set, institute
It states element set and includes multiple element objects, each element object crawls for preserving one of the injection reptile task
Network address.
5. according to claim 1-3 any one of them processing methods, which is characterized in that using described in crawlers execution
Reptile task obtains the multiple result that crawls for crawling network address and includes:
The reptile task is split into multiple subtasks, wherein, each subtask corresponds to and crawls network address described in one;
Perform the multiple subtask, obtain it is the multiple crawl network address crawl result.
6. processing method according to claim 5, which is characterized in that include reptile task injection crawlers:
The reptile task is serialized to obtain mission bit stream, and the mission bit stream is injected the crawlers;
The reptile task is split into multiple subtasks to include:By the mission bit stream unserializing in the crawlers into institute
State reptile task;Based on the reptile task it is corresponding it is multiple it is described crawl network address the reptile task split into it is multiple described
Subtask makes each subtask correspond to described in one and crawls network address.
7. processing method according to claim 5, which is characterized in that perform the multiple subtask, obtain the multiple
The result that crawls for crawling network address includes:
Each subtask is serialized, and each subtask after serializing is sent to climbing of being pre-created
Take message queue;
Start the crawlers, each subtask in message queue is crawled described in execution, obtain each son and appoint
Be engaged in it is corresponding crawl network address crawl result.
8. a kind of processing unit of batch reptile task, which is characterized in that including:
First acquisition unit for obtaining there are same configuration the multiple of information to crawl network address;
Generation unit generates reptile task for being based on the configuration information and the multiple network address that crawls;
Injection unit, for the reptile task to be injected crawlers;
Execution unit for performing the reptile task using the crawlers, obtains the multiple crawling crawling for network address
As a result.
9. processing unit according to claim 8, which is characterized in that the processing unit includes:
Recording unit for performing the reptile task using the crawlers, obtains the multiple crawling climbing for network address
After taking result, the multiple result that crawls for crawling network address is recorded in same destination file.
10. processing method according to claim 9, which is characterized in that the recording unit includes:
Logging modle, for according to it is each described crawl the corresponding mark of network address by it is the multiple crawl network address crawl knot
Fruit is recorded in same destination file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611261546.8A CN108268498B (en) | 2016-12-30 | 2016-12-30 | Processing method and device for batch crawler tasks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611261546.8A CN108268498B (en) | 2016-12-30 | 2016-12-30 | Processing method and device for batch crawler tasks |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108268498A true CN108268498A (en) | 2018-07-10 |
CN108268498B CN108268498B (en) | 2021-06-22 |
Family
ID=62753741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611261546.8A Active CN108268498B (en) | 2016-12-30 | 2016-12-30 | Processing method and device for batch crawler tasks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268498B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951739A (en) * | 2019-03-27 | 2019-06-28 | 北京市博汇科技股份有限公司 | Video traffic processing method, device and electronic equipment |
CN110020066A (en) * | 2017-07-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device of past crawler platform note task |
CN111125478A (en) * | 2018-10-30 | 2020-05-08 | 北京国双科技有限公司 | Data crawling method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452463A (en) * | 2007-12-05 | 2009-06-10 | 浙江大学 | Method and apparatus for directionally grabbing page resource |
CN102184227A (en) * | 2011-05-10 | 2011-09-14 | 北京邮电大学 | General crawler engine system used for WEB service and working method thereof |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
US20130117252A1 (en) * | 2011-11-09 | 2013-05-09 | Google Inc. | Large-scale real-time fetch service |
US20130144834A1 (en) * | 2008-07-21 | 2013-06-06 | Google Inc. | Uniform resource locator canonicalization |
CN103745017A (en) * | 2014-02-10 | 2014-04-23 | 北界创想(北京)软件有限公司 | Information capturing device and method |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN105279272A (en) * | 2015-10-30 | 2016-01-27 | 南京未来网络产业创新有限公司 | Content aggregation method based on distributed web crawlers |
US20160147749A1 (en) * | 2011-05-04 | 2016-05-26 | Yahoo! Inc. | Dynamically determining the relatedness of web objects |
US20160171104A1 (en) * | 2013-09-30 | 2016-06-16 | International Business Machines Corporation | Detecting multistep operations when interacting with web applications |
CN105868258A (en) * | 2015-12-28 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | Crawler system |
-
2016
- 2016-12-30 CN CN201611261546.8A patent/CN108268498B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452463A (en) * | 2007-12-05 | 2009-06-10 | 浙江大学 | Method and apparatus for directionally grabbing page resource |
US20130144834A1 (en) * | 2008-07-21 | 2013-06-06 | Google Inc. | Uniform resource locator canonicalization |
US20160147749A1 (en) * | 2011-05-04 | 2016-05-26 | Yahoo! Inc. | Dynamically determining the relatedness of web objects |
CN102184227A (en) * | 2011-05-10 | 2011-09-14 | 北京邮电大学 | General crawler engine system used for WEB service and working method thereof |
US20130117252A1 (en) * | 2011-11-09 | 2013-05-09 | Google Inc. | Large-scale real-time fetch service |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
US20160171104A1 (en) * | 2013-09-30 | 2016-06-16 | International Business Machines Corporation | Detecting multistep operations when interacting with web applications |
CN103745017A (en) * | 2014-02-10 | 2014-04-23 | 北界创想(北京)软件有限公司 | Information capturing device and method |
CN105279272A (en) * | 2015-10-30 | 2016-01-27 | 南京未来网络产业创新有限公司 | Content aggregation method based on distributed web crawlers |
CN105868258A (en) * | 2015-12-28 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | Crawler system |
Non-Patent Citations (1)
Title |
---|
邹海亮、孙莉: ""可定制的聚焦网络爬虫"", 《电子科技》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110020066A (en) * | 2017-07-31 | 2019-07-16 | 北京国双科技有限公司 | A kind of method and device of past crawler platform note task |
CN110020066B (en) * | 2017-07-31 | 2021-09-07 | 北京国双科技有限公司 | Method and device for annotating tasks to crawler platform |
CN111125478A (en) * | 2018-10-30 | 2020-05-08 | 北京国双科技有限公司 | Data crawling method and device |
CN111125478B (en) * | 2018-10-30 | 2023-05-12 | 北京国双科技有限公司 | Data crawling method and device |
CN109951739A (en) * | 2019-03-27 | 2019-06-28 | 北京市博汇科技股份有限公司 | Video traffic processing method, device and electronic equipment |
CN109951739B (en) * | 2019-03-27 | 2021-06-08 | 北京市博汇科技股份有限公司 | Video service processing method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN108268498B (en) | 2021-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102291842B1 (en) | Techniques for file sharing | |
CN109639740A (en) | A kind of login state sharing method and device based on device id | |
JP2019518257A (en) | State control method and apparatus | |
CN106708858A (en) | Information recommendation method and device | |
CN106778345A (en) | The treating method and apparatus of the data based on operating right | |
CN106981015A (en) | The implementation method of interactive present | |
CN108268498A (en) | The treating method and apparatus of batch reptile task | |
CN105844146B (en) | Method and device for protecting driver and electronic equipment | |
CN108090091A (en) | Web page crawl method and apparatus | |
CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
CN108021400A (en) | Data processing method and device, computer-readable storage medium and equipment | |
CN105022815A (en) | Information interception method and device | |
CN109460676A (en) | A kind of desensitization method of blended data, desensitization device and desensitization equipment | |
CN109582581A (en) | A kind of result based on crowdsourcing task determines method and relevant device | |
CN108073703A (en) | A kind of comment information acquisition methods, device, equipment and storage medium | |
CN106254364A (en) | Computer desktop service access apparatus under a kind of Multi net voting isolation environment and method | |
CN109190405A (en) | A kind of government affairs big data desensitization process method and device | |
CN104980473B (en) | UI resource loading method and system | |
CN106844467A (en) | Method for exhibiting data and device | |
CN109656922A (en) | Data processing method and device | |
CN104461709B (en) | The control method and device of task scheduling | |
CN112448910B (en) | Social engineering honeypot system, honeypot system deployment method, and storage medium | |
CN112541087A (en) | Cross-language knowledge graph construction method and device based on encyclopedia | |
CN107103099A (en) | Main browser page return method and device | |
CN106815183A (en) | The generation method and device of media content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |