CN108268498A - The treating method and apparatus of batch reptile task - Google Patents

The treating method and apparatus of batch reptile task Download PDF

Info

Publication number
CN108268498A
CN108268498A CN201611261546.8A CN201611261546A CN108268498A CN 108268498 A CN108268498 A CN 108268498A CN 201611261546 A CN201611261546 A CN 201611261546A CN 108268498 A CN108268498 A CN 108268498A
Authority
CN
China
Prior art keywords
network address
crawl
reptile task
task
reptile
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611261546.8A
Other languages
Chinese (zh)
Other versions
CN108268498B (en
Inventor
朱长坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611261546.8A priority Critical patent/CN108268498B/en
Publication of CN108268498A publication Critical patent/CN108268498A/en
Application granted granted Critical
Publication of CN108268498B publication Critical patent/CN108268498B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind for the treatment of method and apparatus of batch reptile task.Wherein, this method includes:Obtaining, there are same configuration the multiple of information to crawl network address;Reptile task is generated based on configuration information and multiple network address that crawl;Reptile task is injected into crawlers;Perform reptile task using crawlers, obtain it is multiple crawl network address crawl result.The present invention solves the technical issues of efficiency for handling and there is the batch for similarly configuring information to crawl task is low.

Description

The treating method and apparatus of batch reptile task
Technical field
The present invention relates to reptile task process field, in particular to a kind of processing method of batch reptile task and Device.
Background technology
In existing interim crawler technology, before each website (Url, network resources address) is crawled, it can set It is relevant to crawl configuration information, such as crawl depth, crawl number of pages and whether be catalogue page configuration information.In crawlers Will according to these configuration informations carry out crawl processing accordingly, most crawl at last this website result store in the database or In some file of person.But such one network address of process that crawls only corresponds to a task, and a task only corresponds to a storage knot Fruit.When a batch have same configuration information task need to crawl when, then repetitive operation is needed to inject this process, this for For the user of reptile, efficiency is lower.In addition, if user needs multiple same configuration information crawling task It is stored in same result, operation at this time is with regard to comparable cumbersome:Firstly the need of the unique mark for finding each task, then The result that crawls of these tasks is parsed again, is finally integrated into one as a result, this obvious efficiency of mode that crawls is than relatively low Under.
For above-mentioned processing have same configuration information batch crawl task efficiency it is low the problem of, not yet propose at present Effective solution.
Invention content
An embodiment of the present invention provides a kind for the treatment of method and apparatus of batch reptile task, have at least to solve processing Same configuration information batch crawl task efficiency it is low the technical issues of.
One side according to embodiments of the present invention provides a kind of processing method of batch reptile task, this method packet It includes:Obtaining, there are same configuration the multiple of information to crawl network address;Reptile task is generated based on configuration information and multiple network address that crawl; Reptile task is injected into crawlers;Perform reptile task using crawlers, obtain it is multiple crawl network address crawl result.
Further, the reptile task is being performed using the crawlers, is obtaining the multiple crawling climbing for network address After taking result, the method further includes:The multiple result that crawls for crawling network address is recorded in same destination file.
Further, the network address that crawls corresponds to a mark, it is described by it is the multiple crawl network address crawl knot Fruit is recorded in same destination file and includes:It is crawled according to each corresponding mark of network address that crawls by the multiple The result that crawls of network address is recorded in same destination file.
Further, included based on the configuration information and the multiple network address generation reptile task that crawls:Match by described in Confidence ceases the configuration information as reptile task;The multiple network address that crawls is injected in the specific field of the reptile task, The specific field is element set, and the element set includes multiple element objects, and each element object is used to preserve Inject the reptile task one crawls network address.
Further, the reptile task is performed using the crawlers, obtains the multiple crawling crawling for network address As a result include:The reptile task is split into multiple subtasks, wherein, each subtask corresponds to and crawls net described in one Location;Perform the multiple subtask, obtain it is the multiple crawl network address crawl result.
Further, reptile task injection crawlers are included:The reptile task is serialized to obtain Mission bit stream, and the mission bit stream is injected into the crawlers;And the reptile task is split into multiple subtasks Including:By the mission bit stream unserializing in the crawlers into the reptile task;It is corresponding based on the reptile task Network address being crawled described in multiple, the reptile task being split into multiple subtasks, each subtask is made to correspond to an institute It states and crawls network address.
Further, the multiple subtask is performed, the multiple result that crawls for crawling network address is obtained and includes:To each The subtask is serialized, and each subtask after serializing is sent to be pre-created crawl message team Row;Start the crawlers, each subtask in message queue is crawled described in execution, obtains each subtask It is corresponding crawl network address crawl result.
Another aspect according to embodiments of the present invention additionally provides a kind of processing unit of batch reptile task, the device It can include:First acquisition unit for obtaining there are same configuration the multiple of information to crawl network address;Generation unit, for base Reptile task is generated in configuration information and multiple network address that crawl;Injection unit, for reptile task to be injected crawlers;It performs Unit, for performing reptile task using crawlers, obtain it is multiple crawl network address crawl result.
Further, the processing unit includes:Recording unit, for performing the reptile using the crawlers Task, obtain it is the multiple crawl network address crawl result after, the multiple result that crawls for crawling network address is recorded in together In one destination file.
Further, the recording unit includes:Logging modle, it is corresponding described for crawling network address according to each The multiple result that crawls for crawling network address is recorded in same destination file by mark.
In embodiments of the present invention, can get with same configuration information it is multiple crawl network address after, be based on Configuration information and multiple network address that crawl generate reptile task;Crawlers will be injected comprising multiple reptile tasks for crawling network address, So as to which multiple subtasks are injected crawlers together, without singly injecting crawlers, realize and be directed to The batch for the crawling network address injection of multiple same configuration information substantially increases the injection efficiency of task.Further, it will climb Worm task implantation crawlers after, using crawlers perform reptile task, obtain it is multiple crawl network address crawl as a result, from And directly the result that crawls for crawling network address with same configuration information can be recorded in same destination file, operation letter Single, operating efficiency is high.It solves batch of the processing in the prior art with same configuration information by above-described embodiment and crawls and appoint The problem of efficiency of business is low realizes the effect for improving the efficiency that batch of the processing with same configuration information crawls task.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and forms the part of the application, this hair Bright illustrative embodiments and their description do not constitute improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of batch reptile task according to embodiments of the present invention;
Fig. 2 is the data flow figure of the processing method of batch reptile task according to embodiments of the present invention;
Fig. 3 is the schematic diagram of the processing unit of batch reptile task according to embodiments of the present invention.
Specific embodiment
In order to which those skilled in the art is made to more fully understand the present invention program, below in conjunction in the embodiment of the present invention The technical solution in the embodiment of the present invention is clearly and completely described in attached drawing, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained without making creative work should all belong to the model that the present invention protects It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be the object for distinguishing similar, and specific sequence or precedence are described without being used for.It should be appreciated that it uses in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not listing clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.
According to embodiments of the present invention, a kind of embodiment of the method for the processing method of batch reptile task is provided, is needed Bright, step shown in the flowchart of the accompanying drawings can be held in the computer system of such as a group of computer-executable instructions Row, although also, show logical order in flow charts, it in some cases, can be to be different from sequence herein Perform shown or described step.
Fig. 1 is the flow chart of the processing method of batch reptile task according to embodiments of the present invention, as shown in Figure 1, the party Method includes the following steps:
Step S102, obtaining, there are same configuration the multiple of information to crawl network address;
Step S104 generates reptile task based on configuration information and multiple network address that crawl;
Reptile task is injected crawlers by step S106;
Step S108 performs reptile task using crawlers, obtain it is multiple crawl network address crawl result.By above-mentioned Step, can get with same configuration information it is multiple crawl network address after, based on configuration information and multiple crawl net Location generates reptile task, and the identical multiple corresponding reptile task batches of network address that crawl of configuration information are injected crawlers, and Without singly injecting, the injection efficiency of reptile task is substantially increased.
Embodiment according to figure 1 after step S108 is performed, can also be performed step S110, be crawled multiple The result that crawls of network address is recorded in same destination file.
Optionally, the network address that crawls corresponds to a mark, it is described by it is the multiple crawl network address crawl result Same destination file is recorded in include:Net is crawled by the multiple according to each corresponding mark of network address that crawls The result that crawls of location is recorded in same destination file.By above-described embodiment, can result be crawled with fast recording.
Further, after reptile task is injected crawlers, reptile task is performed using crawlers, is obtained more It is a to crawl the crawling as a result, multiple results that crawl for crawling network address are recorded in destination file of network address, so as to directly will The result that crawls for crawling network address with same configuration information is recorded in same destination file, easy to operate, operating efficiency It is high.It is low to solve the efficiency for handling in the prior art and there is the batch for similarly configuring information to crawl task by above-described embodiment Problem realizes the effect for improving the efficiency that batch of the processing with same configuration information crawls task.
In the above-described embodiments, each reptile task can set configuration information before crawling, these information can be added Be added in the respective attributes field of reptile task, in the above-described embodiments, based on configuration information and it is multiple crawl network address generation climb Worm task, can be using configuration information as the configuration information of reptile task;The specified of network address injection reptile task is crawled by multiple In field, specific field is element set, and element set includes multiple element objects, and each element object is climbed for preserving injection One of worm task crawls network address.
Specifically, each reptile task can crawl information before crawling to its configuration, these configuration informations can be added Be added in the respective attributes (or field) of reptile task, based on it is multiple crawl network address generation reptile task when, can will be configured Information is added into corresponding field, and by multiple specific field Url List for crawling network address and being added to reptile task, the field It is the general type set of a List (i.e. above-mentioned element set), each element of the element set corresponds to a Url object.
In the above-described embodiments, reptile task can be injected by WebApi modules (Web page application program programming interface) Into crawlers.
Reptile task injection crawlers are included:Reptile task is serialized to obtain mission bit stream, and by described in Mission bit stream injects the crawlers.Specifically, the information of reptile task can serialize using device is serialized To mission bit stream, to inject the identical multiple tasks information of configuration information in batches.
Further, reptile task is performed using crawlers, obtains multiple results that crawl for crawling network address and include:It will climb Mission bit stream unserializing in worm program is into reptile task;Based on the corresponding multiple network address that crawl of reptile task by reptile task Multiple subtasks are split into, each subtask is made to correspond to one and crawls network address.In the above-described embodiments, reptile task is split into Multiple subtasks, wherein, each subtask corresponds to one and crawls network address;It performs multiple subtasks, obtains multiple network address that crawl Crawl result.
Specifically, reptile task is split into multiple subtasks to include:By mission bit stream unserializing into reptile task;It will Multiple element objects in the element set of reptile task split into multiple subtasks.
In the interim reptile framework of crawlers, there are one serializing device (the serializing devices of interim reptile mission bit stream Belong to a processing module of monitoring agent device), it is mainly used for serializing the information in reptile task and unserializing. It, can be by each Url in the UrlList set of fields of reptile task when the data for serializing device serializing injection reptile task For analysis of object into being put into a Json array (JArray) after character string, the Key values of this Json array are url;When task is anti- It, can be by content being added in the UrlList attribute sets of reptile task one by one that Key values are url during serializing.
According to the abovementioned embodiments of the present invention, multiple subtasks are performed, multiple results that crawl for crawling network address is obtained and includes: Each subtask is serialized, and each subtask after serializing is sent to be pre-created crawl message queue; Start crawlers, perform and crawl each subtask in message queue, obtain that each subtask is corresponding to crawl climbing for network address Take result.By above-described embodiment, unserializing can be carried out to each subtask, the subtask after unserializing is sent to Crawl message queue;Start crawlers processes, preprocessor process and task termination program process, to each subtask into Row data crawl, obtain each subtask it is corresponding crawl network address crawl result.
The above embodiment of the present invention is described in detail with reference to Fig. 2, as shown in Fig. 2, reptile task injects reptile by Web Api Program in the interim reptile framework of the crawlers, is provided with monitoring agent device and reptile performs program.
Wherein, monitoring agent device includes:Monitoring agent device message queue, reptile Message Queuing resource management module, reptile Process manager module and Zookeeper (distributed application program coordination service) management module.
Monitoring agent device message queue is mainly used for receiving the reptile task of injection;Reptile Message Queuing resource management module It is mainly used for creating the reptile task message queue of reptile task, e.g., establishment crawls message queue (Crawling Queue), pre- Handling message queue (Preparation Queue), (in this embodiment, the link for crawling network address of extraction can be placed in pre- place It manages in message queue, preprocessor can read link from pretreatment message queue, and after pretreatment is completed, link is thrown back Crawl message queue), ending message queue (that is, storage message queue, Storage Queue);Spidering process management module For carrying out the startup of reptile task associated process, destruction and monitoring;Zookeeper management modules are for current to reptile task Execution state carry out record monitoring.
In the above-described embodiments, monitoring agent device is equivalent to the resource of a total activation device, management and monitoring various aspects. In interim reptile framework, the reptile task of injection is serialized to obtain mission bit stream by serializing device, by reptile task Crawlers are injected, monitoring agent device message queue is after the mission bit stream of reptile task of injection is received, monitoring agent device Reptile Message Queuing resource management module will be called to create three above-mentioned message queues, monitoring agent device from monitoring generation It manages after device message queue obtains message, serializing device is by the mission bit stream unserializing in monitoring agent device message queue into climbing Worm task, then Url one by one (i.e. above-mentioned crawls address) object in UrlList element sets in reptile task is torn open again Be divided into reptile task new one by one (i.e. above-mentioned multiple subtasks), the configuration information of each new task and subtask it is unique Identify the identical of the reptile task with most starting injection.Then it by such a new reptile task sequence and sends It is successfully crawled in message queue to having created.
As shown in Fig. 2, the reptile task after splitting is injected into after crawling message queue, monitoring agent device module will open 3 processes of dynamic reptile:Crawlers process, preprocessor process and task termination program process carry out data and crawl, and have Body, crawlers process read message from crawling in message queue, preprocessor process is read from pretreatment message queue Message is taken, task termination program process reads message from ending message queue.After the completion of crawling, network address is crawled according to each Multiple results that crawl for crawling network address are recorded in same destination file by corresponding mark (task ID number).
Wherein, it serializes:It is the process that data structure or object are converted into binary string;Unserializing:Being will be in sequence Binary string generated in row process is converted into data structure or the process of object.
By above-described embodiment, multiple reptile tasks with same configuration information are climbed as a reptile task injection Worm program realizes intelligent batch injection and crawls task, solves multiple batches for crawling task with same configuration information Amount injection, makes the more diversification and intelligence of reptile task injection mode, greatly improves task injection efficiency.
Another aspect according to embodiments of the present invention additionally provides a kind of processing unit of batch reptile task, such as Fig. 3 institutes Show, which can include:First acquisition unit 31 for obtaining there are same configuration the multiple of information to crawl network address;Generation Unit 33 generates reptile task for being based on configuration information and multiple network address that crawl;Injection unit 35, for reptile task to be noted Enter crawlers;Execution unit 37, for performing reptile task using crawlers, obtain it is multiple crawl network address crawl knot Fruit.
By above-mentioned steps, can get with same configuration information it is multiple crawl network address after, based on configuration Information and multiple network address that crawl generate reptile task;Crawlers will be injected comprising multiple reptile tasks for crawling network address, so as to Crawlers can be injected in multiple subtasks together, without singly injecting crawlers, realize for multiple The batch for the crawling network address injection of same configuration information substantially increases the injection efficiency of task.Further, appoint by reptile After business implantation crawlers, reptile task is performed using crawlers, obtains multiple crawling the crawling as a result, by multiple of network address The result that crawls for crawling network address is recorded in destination file, so as to directly crawl network address with same configuration information It crawls result to be recorded in same destination file, easy to operate, operating efficiency is high.Solves existing skill by above-described embodiment Cardia have the batch of same configuration information crawl task efficiency it is low the problem of, realize raising processing and have and identical match The batch of confidence breath crawls the effect of the efficiency of task.
In the above-described embodiments, which further includes recording unit 39, for performing reptile task using crawlers, Obtain it is multiple crawl network address crawl result after, multiple results that crawl for crawling network address are recorded in same destination file In.
Wherein, recording unit includes:Logging modle, for crawling net by multiple according to each corresponding mark of network address that crawls The result that crawls of location is recorded in same destination file.
Further, generation unit includes:Determining module, for the configuration information using configuration information as reptile task; First injection module, for by multiple specific fields for crawling network address injection reptile task, specific field to be element set, member Element set includes multiple element objects, and each element object crawls network address for preserving one of injection reptile task.
Specifically, each reptile task can crawl information before crawling to its configuration, these configuration informations can be added Be added in the respective attributes (or field) of reptile task, based on it is multiple crawl network address generation reptile task when, can will be configured Information is added into corresponding field, and by multiple specific field UrlList for crawling network address and being added to reptile task, which is The general type set of one List (i.e. above-mentioned element set), each element of the element set correspond to a Url object.
In the above-described embodiments, task can be injected into and climbed by WebApi modules (Web page application program programming interface) In worm.
Further, execution unit includes:Module is split, for reptile task to be split into multiple subtasks, wherein, often A subtask corresponds to one and crawls network address;Execution module, for performing multiple subtasks, obtain it is multiple crawl network address crawl knot Fruit.
Further, injection unit includes:Injection module, for being serialized to obtain mission bit stream to reptile task, And mission bit stream is injected into crawlers.Module is split to include:Unserializing module, for by the mission bit stream in crawlers Unserializing is into reptile task;Submodule is split, tears reptile task open for being based on the corresponding multiple network address that crawl of reptile task It is divided into multiple subtasks, each subtask is made to correspond to one and crawls network address.
In the interim reptile framework of crawlers, there are one serializing device (the serializing devices of interim reptile mission bit stream Belong to a processing module of monitoring agent device), it is mainly used for serializing the information in reptile task and unserializing. It, can be by each Url in the UrlList set of fields for injecting task when the data for serializing device serializing injection reptile task For analysis of object into being put into a Json array (JArray) after character string, the Key values of this Json array are url;When task is anti- It, can be by content being added in the UrlList attribute sets of reptile task one by one that Key values are url during serializing.
Further, execution module is specifically used for:Each subtask is serialized, and by each height after serializing Task be sent to be pre-created crawl message queue;Start crawlers, perform each subtask crawled in message queue, Obtain each subtask it is corresponding crawl network address crawl result.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
In the above embodiment of the present invention, all emphasize particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode in actual implementation, for example, multiple units or component can combine or Person is desirably integrated into another system or some features can be ignored or does not perform.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple On unit.Some or all of unit therein can be selected according to the actual needs to realize the purpose of this embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment the method for the present invention whole or Part steps.And aforementioned storage medium includes:USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code Medium.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (10)

1. a kind of processing method of batch reptile task, which is characterized in that including:
Obtaining, there are same configuration the multiple of information to crawl network address;
Reptile task is generated based on the configuration information and the multiple network address that crawls;
The reptile task is injected into crawlers;
Perform the reptile task using the crawlers, obtain it is the multiple crawl network address crawl result.
2. processing method according to claim 1, which is characterized in that appoint performing the reptile using the crawlers Business, obtain it is the multiple crawl network address crawl result after, the method further includes:Crawling for network address is crawled by the multiple As a result it is recorded in same destination file.
3. processing method according to claim 2, which is characterized in that each network address that crawls corresponds to a mark, institute It states and the multiple result that crawls for crawling network address is recorded in same destination file and includes:
According to it is each it is described crawl network address it is corresponding it is described mark by it is the multiple crawl network address crawl result be recorded in it is same In a destination file.
4. according to claim 1-3 any one of them processing methods, which is characterized in that based on the configuration information and described more A network address generation reptile task that crawls includes:
Using the configuration information as the configuration information of reptile task;
The multiple network address that crawls is injected in the specific field of the reptile task, the specific field is element set, institute It states element set and includes multiple element objects, each element object crawls for preserving one of the injection reptile task Network address.
5. according to claim 1-3 any one of them processing methods, which is characterized in that using described in crawlers execution Reptile task obtains the multiple result that crawls for crawling network address and includes:
The reptile task is split into multiple subtasks, wherein, each subtask corresponds to and crawls network address described in one;
Perform the multiple subtask, obtain it is the multiple crawl network address crawl result.
6. processing method according to claim 5, which is characterized in that include reptile task injection crawlers: The reptile task is serialized to obtain mission bit stream, and the mission bit stream is injected the crawlers;
The reptile task is split into multiple subtasks to include:By the mission bit stream unserializing in the crawlers into institute State reptile task;Based on the reptile task it is corresponding it is multiple it is described crawl network address the reptile task split into it is multiple described Subtask makes each subtask correspond to described in one and crawls network address.
7. processing method according to claim 5, which is characterized in that perform the multiple subtask, obtain the multiple The result that crawls for crawling network address includes:
Each subtask is serialized, and each subtask after serializing is sent to climbing of being pre-created Take message queue;
Start the crawlers, each subtask in message queue is crawled described in execution, obtain each son and appoint Be engaged in it is corresponding crawl network address crawl result.
8. a kind of processing unit of batch reptile task, which is characterized in that including:
First acquisition unit for obtaining there are same configuration the multiple of information to crawl network address;
Generation unit generates reptile task for being based on the configuration information and the multiple network address that crawls;
Injection unit, for the reptile task to be injected crawlers;
Execution unit for performing the reptile task using the crawlers, obtains the multiple crawling crawling for network address As a result.
9. processing unit according to claim 8, which is characterized in that the processing unit includes:
Recording unit for performing the reptile task using the crawlers, obtains the multiple crawling climbing for network address After taking result, the multiple result that crawls for crawling network address is recorded in same destination file.
10. processing method according to claim 9, which is characterized in that the recording unit includes:
Logging modle, for according to it is each described crawl the corresponding mark of network address by it is the multiple crawl network address crawl knot Fruit is recorded in same destination file.
CN201611261546.8A 2016-12-30 2016-12-30 Processing method and device for batch crawler tasks Active CN108268498B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611261546.8A CN108268498B (en) 2016-12-30 2016-12-30 Processing method and device for batch crawler tasks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611261546.8A CN108268498B (en) 2016-12-30 2016-12-30 Processing method and device for batch crawler tasks

Publications (2)

Publication Number Publication Date
CN108268498A true CN108268498A (en) 2018-07-10
CN108268498B CN108268498B (en) 2021-06-22

Family

ID=62753741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611261546.8A Active CN108268498B (en) 2016-12-30 2016-12-30 Processing method and device for batch crawler tasks

Country Status (1)

Country Link
CN (1) CN108268498B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951739A (en) * 2019-03-27 2019-06-28 北京市博汇科技股份有限公司 Video traffic processing method, device and electronic equipment
CN110020066A (en) * 2017-07-31 2019-07-16 北京国双科技有限公司 A kind of method and device of past crawler platform note task
CN111125478A (en) * 2018-10-30 2020-05-08 北京国双科技有限公司 Data crawling method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
US20130117252A1 (en) * 2011-11-09 2013-05-09 Google Inc. Large-scale real-time fetch service
US20130144834A1 (en) * 2008-07-21 2013-06-06 Google Inc. Uniform resource locator canonicalization
CN103745017A (en) * 2014-02-10 2014-04-23 北界创想(北京)软件有限公司 Information capturing device and method
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
US20160147749A1 (en) * 2011-05-04 2016-05-26 Yahoo! Inc. Dynamically determining the relatedness of web objects
US20160171104A1 (en) * 2013-09-30 2016-06-16 International Business Machines Corporation Detecting multistep operations when interacting with web applications
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
US20130144834A1 (en) * 2008-07-21 2013-06-06 Google Inc. Uniform resource locator canonicalization
US20160147749A1 (en) * 2011-05-04 2016-05-26 Yahoo! Inc. Dynamically determining the relatedness of web objects
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
US20130117252A1 (en) * 2011-11-09 2013-05-09 Google Inc. Large-scale real-time fetch service
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
US20160171104A1 (en) * 2013-09-30 2016-06-16 International Business Machines Corporation Detecting multistep operations when interacting with web applications
CN103745017A (en) * 2014-02-10 2014-04-23 北界创想(北京)软件有限公司 Information capturing device and method
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邹海亮、孙莉: ""可定制的聚焦网络爬虫"", 《电子科技》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020066A (en) * 2017-07-31 2019-07-16 北京国双科技有限公司 A kind of method and device of past crawler platform note task
CN110020066B (en) * 2017-07-31 2021-09-07 北京国双科技有限公司 Method and device for annotating tasks to crawler platform
CN111125478A (en) * 2018-10-30 2020-05-08 北京国双科技有限公司 Data crawling method and device
CN111125478B (en) * 2018-10-30 2023-05-12 北京国双科技有限公司 Data crawling method and device
CN109951739A (en) * 2019-03-27 2019-06-28 北京市博汇科技股份有限公司 Video traffic processing method, device and electronic equipment
CN109951739B (en) * 2019-03-27 2021-06-08 北京市博汇科技股份有限公司 Video service processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN108268498B (en) 2021-06-22

Similar Documents

Publication Publication Date Title
KR102291842B1 (en) Techniques for file sharing
CN109639740A (en) A kind of login state sharing method and device based on device id
JP2019518257A (en) State control method and apparatus
CN106708858A (en) Information recommendation method and device
CN106778345A (en) The treating method and apparatus of the data based on operating right
CN106981015A (en) The implementation method of interactive present
CN108268498A (en) The treating method and apparatus of batch reptile task
CN105844146B (en) Method and device for protecting driver and electronic equipment
CN108090091A (en) Web page crawl method and apparatus
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN108021400A (en) Data processing method and device, computer-readable storage medium and equipment
CN105022815A (en) Information interception method and device
CN109460676A (en) A kind of desensitization method of blended data, desensitization device and desensitization equipment
CN109582581A (en) A kind of result based on crowdsourcing task determines method and relevant device
CN108073703A (en) A kind of comment information acquisition methods, device, equipment and storage medium
CN106254364A (en) Computer desktop service access apparatus under a kind of Multi net voting isolation environment and method
CN109190405A (en) A kind of government affairs big data desensitization process method and device
CN104980473B (en) UI resource loading method and system
CN106844467A (en) Method for exhibiting data and device
CN109656922A (en) Data processing method and device
CN104461709B (en) The control method and device of task scheduling
CN112448910B (en) Social engineering honeypot system, honeypot system deployment method, and storage medium
CN112541087A (en) Cross-language knowledge graph construction method and device based on encyclopedia
CN107103099A (en) Main browser page return method and device
CN106815183A (en) The generation method and device of media content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant