CN107305548A - Control the method for allocating tasks and device of web crawlers - Google Patents

Control the method for allocating tasks and device of web crawlers Download PDF

Info

Publication number
CN107305548A
CN107305548A CN201610243866.4A CN201610243866A CN107305548A CN 107305548 A CN107305548 A CN 107305548A CN 201610243866 A CN201610243866 A CN 201610243866A CN 107305548 A CN107305548 A CN 107305548A
Authority
CN
China
Prior art keywords
thread
task
semaphore
line number
multithreading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610243866.4A
Other languages
Chinese (zh)
Other versions
CN107305548B (en
Inventor
杨杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610243866.4A priority Critical patent/CN107305548B/en
Publication of CN107305548A publication Critical patent/CN107305548A/en
Application granted granted Critical
Publication of CN107305548B publication Critical patent/CN107305548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of method for allocating tasks and device of control web crawlers.Wherein, web crawlers performs task using multithreading, and multithreading is deposited in thread pool in advance, and this method includes:Judge whether the quantity of the mission thread in multithreading reaches task maximum and line number by semaphore, wherein, the initial value of semaphore is task maximum and line number, and mission thread is the thread by semaphore;When the quantity of the mission thread in judging multithreading by semaphore reaches task maximum and line number, the thread in thread pool is prevented to obtain task from URL queues;And when the quantity of the mission thread during multithreading is judged by semaphore is not up to task maximum and line number, the thread in control thread pool obtains task from URL queues.Present application addresses distribute web crawlers task by middle controller in correlation technique to cause crawler system to become complex technical problem.

Description

Control the method for allocating tasks and device of web crawlers
Technical field
The application is related to internet arena, in particular to a kind of method for allocating tasks and dress of control web crawlers Put.
Background technology
Web crawlers can be from the URL (Uniform of one or several Initial pages when crawling webpage Resource Locator, referred to as URL) start, all URL on Initial page are extracted, and be put into URL teams Row, wait web crawlers to obtain a new URL from URL queues and continue executing with web page crawl.Existing method is logical Middle controller is crossed to distribute task to web crawlers, so the distribution of web crawlers task is highly dependent on middle control Device, web crawlers distribution can be caused when abnormal conditions occurs in middle controller less than task or multitask was distributed, If web crawlers distribution will be constantly in idle condition less than task, machine resources are wasted;If web crawlers is same Task excessive Shi Zhihang can then cause web crawlers to collapse, and cause the loss of task and data, bring more serious Consequence.
Specifically, the existing method for distributing web crawlers task by middle controller has the disadvantage that:First, need Separately to write middle controller program distribution web crawlers crawls task so that whole crawler system becomes complicated, Not easy care;Secondly, the distribution of web crawlers task is highly dependent upon middle controller program, when middle director demon When there is abnormal collapse, the problem of having loss of data or waste machine resources.
Crawler system is caused to become complex for distributing web crawlers task by middle controller in correlation technique Problem, not yet proposes effective solution at present.
The content of the invention
The main purpose of the application is to provide a kind of method for allocating tasks and device of control web crawlers, to solve phase Distributing web crawlers task by middle controller in the technology of pass causes the problem of crawler system becomes complex.
To achieve these goals, divided according to the one side of the application there is provided a kind of the controlling web crawlers of the task Method of completing the square.Web crawlers performs task using multithreading, and multithreading is deposited in thread pool in advance, and this method includes: Judge whether the quantity of the mission thread in multithreading reaches task maximum and line number by semaphore, wherein, semaphore Initial value be task maximum and line number, mission thread is the thread by semaphore;Judged by semaphore it is many When the quantity of mission thread in thread reaches task maximum and line number, the thread in thread pool is prevented from URL queues Acquisition task;And to be not up to task maximum parallel for the quantity of the mission thread in multithreading is judged by semaphore During number, the thread in control thread pool obtains task from URL queues.
Further, judge whether the quantity of the mission thread in multithreading reaches task maximum and line number by semaphore Including:Whether the value for judging semaphore is 0;Judge semaphore value be 0 when, it is determined that in multithreading appoint The quantity of business thread reaches task maximum and line number;And when it is not 0 to judge the value of semaphore, it is determined that it is multi-thread The quantity of mission thread in journey is not up to task maximum and line number.
Further, to be not up to task maximum parallel for the quantity of the mission thread in multithreading is judged by semaphore During number, the thread in control thread pool is from URL queues after acquisition task, and this method also includes:By semaphore Value subtracts 1, and when the tasks carrying of mission thread is completed or cancelled, method also includes:The value of semaphore is added 1.
Further, the quantity of the mission thread in multithreading is judged by semaphore reaches task maximum and line number When, prevent the thread in thread pool from URL queues after acquisition task, this method also includes:Control in thread pool Thread enter wait state.
Further, whether the quantity of the mission thread in multithreading is judged by semaphore reaches that task maximum is parallel Before number, this method also includes:When web crawlers starts, task maximum and line number are read from database, wherein, Task maximum and line number are previously stored with database;And simultaneously line number is assigned to semaphore, and conduct by task maximum The initial value of semaphore.
To achieve these goals, divided according to the another aspect of the application there is provided a kind of the controlling web crawlers of the task With device.Web crawlers performs task using multithreading, and multithreading is deposited in thread pool in advance, and the device includes: Judging unit, for judging whether the quantity of the mission thread in multithreading reaches task maximum and line number by semaphore, Wherein, the initial value of semaphore is task maximum and line number, and mission thread is the thread by semaphore;Prevent unit, For when the quantity of the mission thread during multithreading is judged by semaphore reaches task maximum and line number, preventing line Thread in Cheng Chi obtains task from URL queues;And first control unit, for judging by semaphore The quantity of mission thread in multithreading is not up to task maximum and during line number, and the thread in control thread pool is from URL teams Task is obtained in row.
Further, judging unit includes:Judge module, for judging whether the value of semaphore is 0;First determines Module, for judge semaphore value be 0 when, it is determined that the quantity of the mission thread in multithreading reaches task Maximum and line number;And second determining module, for when it is not 0 to judge the value of semaphore, it is determined that multithreading In the quantity of mission thread be not up to task maximum and line number.
Further, the device also includes:Subtract unit, for the task line in multithreading is judged by semaphore The quantity of journey is not up to task maximum and during line number, the thread in control thread pool obtained from URL queues task it Afterwards, the value of semaphore is subtracted 1, plus unit, for when the tasks carrying of mission thread is completed or cancelled, by signal The value of amount adds 1.
Further, the device also includes:Second control unit, in multithreading is judged by semaphore When the quantity of mission thread reaches task maximum and line number, the thread in thread pool is prevented to obtain task from URL queues Afterwards, the thread in control thread pool enters wait state.
Further, the device also includes:Reading unit, for when web crawlers starts, being read from database Task maximum and line number, wherein, task maximum and line number are previously stored with database;And assignment unit, it is used for By task maximum, simultaneously line number is assigned to semaphore, and is used as the initial value of semaphore.
The web crawlers of the application performs task using multithreading, and multithreading is deposited in thread pool in advance, passes through signal Amount judges whether the quantity of the mission thread in multithreading reaches task maximum and line number, wherein, the initial value of semaphore For task maximum and line number, mission thread is the thread by semaphore;In multithreading is judged by semaphore When the quantity of mission thread reaches task maximum and line number, prevent the thread in thread pool from being obtained from URL queues and appoint Business;And when the quantity of the mission thread during multithreading is judged by semaphore is not up to task maximum and line number, Thread in control thread pool obtains task from URL queues, and the application is crawled by semaphore control web crawlers to be appointed Business number, not only simplify the structure of crawler system, and web crawlers can be avoided to perform excessive or very few task, Solve and distribute web crawlers task by middle controller in correlation technique and cause crawler system to become complex to ask Topic, and then reached the effect for the structure for simplifying crawler system.
Brief description of the drawings
The accompanying drawing for constituting the part of the application is used for providing further understanding of the present application, the schematic reality of the application Apply example and its illustrate to be used to explain the application, do not constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the flow chart of the method for allocating tasks of the control web crawlers according to the embodiment of the present application;And
Fig. 2 is the schematic diagram of the task allocation apparatus of the control web crawlers according to the embodiment of the present application.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the application in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment The only embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to The scope of the application protection.
It should be noted that term " first " in the description and claims of this application and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein.In addition, term " comprising " and " having " and their any deformation, it is intended that covering is non-exclusive to be included, for example, comprising The process of series of steps or unit, method, system, product or equipment are not necessarily limited to those steps clearly listed Rapid or unit, but may include not listing clearly or intrinsic for these processes, method, product or equipment Other steps or unit.
For the ease of description, some concepts or term being related to below to the application are illustrated:
Crawler system, refers to, for performing the system that web page contents are crawled, web crawlers is provided with the crawler system.
Web crawlers, is a kind of program or script for capturing information in internet automatically according to preset rules.
According to the embodiment of the present application, there is provided a kind of method for allocating tasks of control web crawlers.The embodiment of the present application Web crawlers is used to live execution task, and multithreading is deposited in thread pool in advance.Fig. 1 is implemented according to the application The flow chart of the method for allocating tasks of the control web crawlers of example, as shown in figure 1, this method includes steps S102 To step S106:
Step S102, judges whether the quantity of the mission thread in multithreading reaches that task maximum is parallel by semaphore Number, wherein, the initial value of semaphore is task maximum and line number, and mission thread is the thread by semaphore.
The semaphore of the embodiment of the present application is a kind of Synchronized Mechanism for Multithread based on counter, under multi-thread environment, Semaphore is responsible for coordinating each thread, to ensure use public resource that each thread can hold water.Specifically, signal Amount is by access of the counter controls to shared resource, and the value of semaphore is a nonnegative integer, all to pass through letter Number amount thread all the integer can be subtracted 1, if the value of semaphore be more than 0, access be allowed to;If semaphore It is worth for 0, then accesses and be prohibited, all threads for trying to pass through semaphore will be all waited for.
Task maximum and line number in the embodiment of the present application refer to the greatest measure of web crawlers executing tasks parallelly, specifically Ground, task maximum and line number can be configured according to the performance of crawler system, for example, 500, i.e., at most can be parallel Handle 500 tasks.Using the task maximum, simultaneously line number, as the initial value of semaphore, is also signal to the embodiment of the present application The maximum of amount.Alternatively, the value of task maximum and line number can be specified in a program in advance or in journey Obtained in program process from database.Preferably, for the ease of later maintenance and modification, sentence by semaphore Whether the quantity of the mission thread in disconnected multithreading reaches before task maximum and line number that this method also includes:In network When reptile starts, task maximum and line number are read from database, wherein, task maximum is previously stored with database And line number;And simultaneously line number is assigned to semaphore by task maximum, and it is used as the initial value of semaphore.
The database of the embodiment of the present application can be the database in crawler system, for example, for storing crawler system The remote data base of configuration item.Specifically, can configuration task is maximum and field and value of line number in database in advance, When starting web crawlers, the value of task maximum and line number is obtained according to the field of task maximum and line number from database, And it is assigned to semaphore.The embodiment of the present application reads from database task maximum and line number by web crawlers Value, so that user can change the value of task maximum and line number at any time according to demand, in addition, web crawlers is usually deployed On many machines, it is easy to quickly realize on many machines by the value for reading task maximum and line number from database The modification of the task maximum and line number of web crawlers.
In the embodiment of the present application, web crawlers performs task, therefore, all threads of web crawlers using multithreading Semaphore is all had to pass through during execution task.Specifically, can be by the value of semaphore when some thread goes execution task Subtract 1, when the quantity of mission thread reaches maximum tasks in parallel number, now the value of semaphore is kept to 0, other threads this When have to wait for, until the value of semaphore is released more than 0, i.e. semaphore.
Alternatively, judge whether the quantity of the mission thread in multithreading reaches task maximum and line number bag by semaphore Include:Whether the value for judging semaphore is 0;Judge semaphore value be 0 when, it is determined that the task in multithreading The quantity of thread reaches task maximum and line number;And when it is not 0 to judge the value of semaphore, it is determined that multithreading In the quantity of mission thread be not up to task maximum and line number.
From the foregoing, it will be observed that when mission thread reaches maximum tasks in parallel number, the value of semaphore is kept to 0, it therefore, it can lead to The value for crossing semaphore judges whether the quantity of the mission thread in multithreading reaches task maximum and line number.Specifically, when When the value of semaphore is 0, represent that the quantity of mission thread reaches task maximum and line number, when the value of semaphore is more than 0 When, then it represents that the quantity of mission thread not yet reaches task maximum and line number.
Step S104, the quantity of the mission thread in multithreading is judged by semaphore reaches task maximum and line number When, prevent the thread in thread pool from obtaining task from URL queues.
In order to avoid web crawlers performed multitask collapse, the quantity of the mission thread in multithreading is judged reaches When task maximum and line number, prevent the thread in thread pool from obtaining task from URL queues, that is, prevent out of thread pool Obtain new thread and task is obtained from URL queues by the thread and performed, wherein, URL queues are use In the queue for storing URL to be crawled, web crawlers carries out webpage by obtaining URL to be crawled from URL queues Crawl.Alternatively, the quantity of the mission thread in multithreading is judged by semaphore reaches task maximum and line number When, prevent the thread in thread pool from URL queues after acquisition task, this method also includes:Control in thread pool Thread enter wait state.
Specifically, when the value of semaphore is 0, the thread in all thread pools will be all waited for, until There is mission thread and performed task or cancellation task, semaphore release (i.e. the value of semaphore is more than 0) just allows The thread being waited for goes in URL queues to obtain tasks carrying.
Step S106, it is parallel that the quantity of the mission thread in multithreading is judged by semaphore is not up to task maximum During number, the thread in control thread pool obtains task from URL queues.
In order to avoid web crawlers execution task is very few, machine resources, the mission thread in multithreading is judged are wasted Quantity be not up to task maximum and during line number, the thread in control thread pool obtains task from URL queues, i.e., from Thread is obtained in thread pool, and task is obtained from URL queues by the thread.Alternatively, sentence by semaphore The quantity of the mission thread in multithreading of breaking is not up to task maximum and during line number, the thread in control thread pool from In URL queues after acquisition task, this method also includes:The value of semaphore is subtracted 1.
It should be noted that when the tasks carrying of mission thread is completed or cancelled, this method also includes:By semaphore Value add 1.
The web crawlers of the embodiment of the present application performs task using multithreading, and the task in multithreading is judged by semaphore Whether the quantity of thread reaches task maximum and line number, wherein, the initial value of semaphore is task maximum and line number, is appointed Thread be engaged in for by the thread of semaphore;The quantity of mission thread in multithreading is judged by semaphore, which reaches, appoints When business maximum and line number, the thread in thread pool is prevented to obtain task from URL queues;And sentence by semaphore The quantity of the mission thread in multithreading of breaking is not up to task maximum and during line number, the thread in control thread pool from Task is obtained in URL queues, the embodiment of the present application controls web crawlers to crawl number of tasks by semaphore, without extra Middle controller, not only simplify the structure of crawler system, and web crawlers can be avoided to perform excessive or mistake Few task, solve in correlation technique by middle controller distribute web crawlers task cause crawler system become compared with The problem of for complexity, and then reached the effect for the structure for simplifying crawler system.
According to the method for allocating tasks of the control web crawlers of the another embodiment of the application, comprise the following steps:
Step S202, the maximum simultaneously line number of configuration task in remote data base.
Specifically, configuration task is maximum and field and value of line number in remote data base.
Step S204, the maximum simultaneously line number of web crawlers initialization task.
Specifically, when starting web crawler, obtained from remote data base according to the field of task maximum and line number Take the value of task maximum and line number.
Step S206, is assigned to semaphore, and control appointing for web crawlers by semaphore by the value of task maximum and line number Business distribution.
The semaphore of the embodiment of the present application is a kind of Synchronized Mechanism for Multithread based on counter, under multi-thread environment, Semaphore is responsible for coordinating each thread, to ensure use public resource that each thread can hold water.Semaphore passes through one Access of the individual counter controls to shared resource, the value of semaphore is a nonnegative integer, all lines by semaphore The integer is subtracted 1 by Cheng Douhui.Specifically, if the value of semaphore is more than 0, access is allowed to, and the value of semaphore subtracts 1;If the value of semaphore is 0, access is prohibited, now, all to attempt all be in by the thread of semaphore Wait state.
Specifically, in reptile framework, web crawlers is to go execution task using multithreading, and all threads can all be put In a thread pool, after a mission thread has performed a task, a thread will be taken out out of thread pool Go execution task.Specifically, all thread execution tasks all have to pass through semaphore, when a thread goes execution task When, the value of semaphore can be subtracted 1, when the quantity of mission thread reaches maximum tasks in parallel number, the value of semaphore subtracts For 0, other threads (thread i.e. in thread pool) now have to wait for, until the value of semaphore is more than 0, i.e. signal Amount is released;If the tasks carrying of some mission thread is complete or is cancelled, semaphore is released, i.e. signal quantity Plus 1, now having new thread (for example, the thread being waited in thread pool) again goes in URL queues to obtain Take tasks carrying.The embodiment of the present application controls the number of tasks that web crawlers is crawled by semaphore intelligence, it is to avoid climb Worm performs excessive or very few task.
As can be seen from the above description, the embodiment of the present application by remote data base Configuration network reptile climb The task maximum and line number of task are taken, and as the maximum of semaphore, passes through the control web crawlers of semaphore intelligence Number of tasks is crawled, middle controller is eliminated, crawler system is simplified;Web crawlers can be intelligently from URL teams simultaneously Appropriate task is obtained in row so that whole crawler system becomes to be more easy to operation and maintenance, while can guarantee that web crawlers Data normally are crawled, so as to avoid causing reptile while crawling excessive because exception occurs in reptile task monitor Task is idle for a long time, further results in web crawlers collapse and causes the problems such as loss of data or machine resources are wasted.
It should be noted that can be in such as one group computer executable instructions the step of the flow of accompanying drawing is illustrated Performed in computer system, and, although logical order is shown in flow charts, but in some cases, can With with different from the shown or described step of order execution herein.
According to the another aspect of the embodiment of the present application, there is provided a kind of task allocation apparatus of control web crawlers, the control The task allocation apparatus of web crawlers processed can be used for the task distribution side for the control web crawlers for performing the embodiment of the present application Method, the method for allocating tasks of the control web crawlers of the embodiment of the present application can also be by the control net of the embodiment of the present application The task allocation apparatus of network reptile is performed.
It should be noted that the web crawlers of the embodiment of the present application performs task using multithreading, Fig. 2 is according to this Shen Please embodiment control web crawlers task allocation apparatus schematic diagram.As shown in Fig. 2 the device includes:Judge Unit 10, the prevention control unit 30 of unit 20 and first.
Judging unit 10, for judging whether the quantity of the mission thread in multithreading reaches task maximum by semaphore And line number, wherein, the initial value of semaphore is task maximum and line number, and mission thread is the thread by semaphore.
Alternatively, judging unit 10 includes:Judge module, for judging whether the value of semaphore is 0;First determines Module, for judge semaphore value be 0 when, it is determined that the quantity of the mission thread in multithreading reaches task Maximum and line number;And second determining module, for when it is not 0 to judge the value of semaphore, it is determined that multithreading In the quantity of mission thread be not up to task maximum and line number.
Unit 20 is prevented, the quantity for the mission thread in multithreading is judged by semaphore reaches task maximum And during line number, prevent the thread in thread pool from obtaining task from URL queues.
First control unit 30, the quantity for the mission thread in multithreading is judged by semaphore, which is not up to, appoints When business maximum and line number, the thread in control thread pool obtains task from URL queues.
The web crawlers of the embodiment of the present application performs task using multithreading, and the application passes through signal by judging unit 10 Amount judges whether the quantity of the mission thread in multithreading reaches task maximum and line number, wherein, the initial value of semaphore For task maximum and line number, mission thread is the thread by semaphore;Unit 20 is prevented to judge by semaphore The quantity of mission thread in multithreading reaches task maximum and during line number, prevents thread in thread pool from URL queues Middle acquisition task;And first mission thread of the control unit 30 in multithreading is judged by semaphore quantity not When reaching task maximum and line number, the thread in control thread pool obtains task, the embodiment of the present application from URL queues Control web crawlers to crawl number of tasks by semaphore, not only simplify the structure of crawler system, and net can be avoided Network reptile performs excessive or very few task, solves in correlation technique by middle controller distribution web crawlers times Business causes the problem of crawler system becomes complex, and then has reached the effect for the structure for simplifying crawler system.
Alternatively, the device also includes:Subtract unit, for the mission thread in multithreading is judged by semaphore Quantity be not up to task maximum and during line number, the thread in control thread pool from URL queues after acquisition task, The value of semaphore is subtracted 1, plus unit, for when the tasks carrying of mission thread is completed or cancelled, by semaphore Value Jia 1.
Alternatively, the device also includes:Second control unit, for appointing in multithreading is judged by semaphore The quantity of business thread reaches task maximum and during line number, prevent the thread in thread pool obtained from URL queues task it Afterwards, the thread in control thread pool enters wait state.
Preferably, the device also includes:Reading unit, for when web crawlers starts, reading and appointing from database The maximum simultaneously line number of business, wherein, task maximum and line number are previously stored with database;And assignment unit, for inciting somebody to action Simultaneously line number is assigned to semaphore to task maximum, and is used as the initial value of semaphore.
The task allocation apparatus of the control web crawlers includes processor and memory, and above-mentioned judging unit, prevention are single Member and the first control unit etc. in memory, are stored in memory as program unit storage by computing device Said procedure unit realize corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, control the task of web crawlers to distribute by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment, it is adapted for carrying out just The program code of beginningization there are as below methods step:Judge whether the quantity of the mission thread in multithreading reaches by semaphore To task maximum and line number, wherein, the initial value of semaphore is task maximum and line number, and mission thread is to pass through signal The thread of amount;When the quantity of the mission thread during multithreading is judged by semaphore reaches task maximum and line number, The thread in thread pool is prevented to obtain task from URL queues;And appointing in multithreading is judged by semaphore When the quantity of business thread is not up to task maximum and line number, the thread in control thread pool is obtained from URL queues appoints Business.
Above-mentioned the embodiment of the present application sequence number is for illustration only, and the quality of embodiment is not represented.
In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, may refer to the associated description of other embodiment.
, can be by other in several embodiments provided herein, it should be understood that disclosed technology contents Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit, It can be a kind of division of logic function, can have other dividing mode when actually realizing, such as multiple units or component Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, institute Display or the coupling each other discussed or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be selected to realize this embodiment scheme according to the actual needs Purpose.
In addition, each functional unit in the application each embodiment can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.It is above-mentioned integrated Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used When, it can be stored in a computer read/write memory medium.Understood based on such, the technical scheme of the application The part substantially contributed in other words to prior art or all or part of the technical scheme can be produced with software The form of product is embodied, and the computer software product is stored in a storage medium, including some instructions are to make Obtain a computer equipment (can be personal computer, server or network equipment etc.) and perform each implementation of the application The all or part of step of example methods described.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, Magnetic disc or CD etc. are various can be with the medium of store program codes.
Described above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, on the premise of the application principle is not departed from, some improvements and modifications can also be made, these improve and moistened Decorations also should be regarded as the protection domain of the application.

Claims (10)

1. a kind of method for allocating tasks of control web crawlers, it is characterised in that the web crawlers is performed using multithreading Task, the multithreading is deposited in thread pool in advance, and methods described includes:
Judge whether the quantity of the mission thread in the multithreading reaches task maximum and line number by semaphore, Wherein, the initial value of the semaphore is the task maximum and line number;
The quantity of mission thread in the multithreading is judged by the semaphore reaches the task maximum And during line number, prevent the thread in the thread pool from obtaining task from URL queues;And
The quantity of mission thread in the multithreading is judged by the semaphore is not up to the task most During big and line number, the thread in the thread pool is controlled to obtain task from the URL queues.
2. according to the method described in claim 1, it is characterised in that the task in the multithreading is judged by semaphore Whether the quantity of thread reaches task maximum and line number includes:
Whether the value for judging the semaphore is 0;
Judge the semaphore value be 0 when, it is determined that the quantity of the mission thread in the multithreading reaches To the task maximum and line number;And
When it is not 0 to judge the value of the semaphore, it is determined that the quantity of the mission thread in the multithreading Not up to described task maximum and line number.
3. method according to claim 1 or 2, it is characterised in that
The quantity of mission thread in the multithreading is judged by the semaphore is not up to the task most During big and line number, control the thread in the thread pool from URL queues after acquisition task, methods described is also Including:The value of the semaphore is subtracted 1,
When the tasks carrying of the mission thread is completed or cancelled, methods described also includes:By the semaphore Value add 1.
4. method according to claim 1 or 2, it is characterised in that judged by the semaphore it is described many When the quantity of mission thread in thread reaches the task maximum and line number, the thread in the thread pool is prevented From URL queues after acquisition task, methods described also includes:Control the thread in the thread pool to enter etc. Treat state.
5. method according to claim 1 or 2, it is characterised in that in the multithreading is judged by semaphore Mission thread quantity whether reach task maximum and line number before, methods described also includes:
When the web crawlers starts, the task maximum and line number are read from database, wherein, it is described The task maximum and line number are previously stored with database;And
By the task maximum, simultaneously line number is assigned to the semaphore, and is used as the initial value of the semaphore.
6. a kind of task allocation apparatus of control web crawlers, it is characterised in that the web crawlers is performed using multithreading Task, the multithreading is deposited in thread pool in advance, and described device includes:
Judging unit, appoints for judging whether the quantity of the mission thread in the multithreading reaches by semaphore The maximum simultaneously line number of business, wherein, the initial value of the semaphore is the task maximum and line number, the task line Journey is the thread by the semaphore;
Unit is prevented, the quantity for the mission thread in the multithreading is judged by the semaphore reaches During to the task maximum and line number, the thread in the thread pool is prevented to obtain task from URL queues;With And
First control unit, the number for the mission thread in the multithreading is judged by the semaphore When amount is not up to the task maximum and line number, the thread in the thread pool is controlled to be obtained from the URL queues Take task.
7. device according to claim 6, it is characterised in that the judging unit includes:
Judge module, for judging whether the value of the semaphore is 0;
First determining module, for judge the semaphore value be 0 when, it is determined that in the multithreading The quantity of mission thread reach the task maximum and line number;And
Second determining module, for when it is not 0 to judge the value of the semaphore, it is determined that the multithreading In the quantity of mission thread be not up to the task maximum and line number.
8. the device according to claim 6 or 7, it is characterised in that described device also includes:Subtract unit, be used for The quantity of mission thread in the multithreading is judged by the semaphore is not up to the task maximum simultaneously During line number, control the thread in the thread pool from URL queues after acquisition task, by the semaphore Value subtracts 1, plus unit, for when the tasks carrying of the mission thread is completed or cancelled, by the semaphore Value add 1.
9. the device according to claim 6 or 7, it is characterised in that described device also includes:Second control unit, Quantity for the mission thread in the multithreading is judged by the semaphore reaches the task maximum And during line number, prevent the thread in the thread pool from URL queues after acquisition task, control the thread Thread in pond enters wait state.
10. the device according to claim 6 or 7, it is characterised in that described device also includes:
Reading unit, for when the web crawlers starts, the task maximum being read from database parallel Number, wherein, the task maximum and line number are previously stored with the database;And
Assignment unit, for the task maximum and line number to be assigned into the semaphore, and is used as the signal The initial value of amount.
CN201610243866.4A 2016-04-18 2016-04-18 Task allocation method and device for controlling web crawler Active CN107305548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610243866.4A CN107305548B (en) 2016-04-18 2016-04-18 Task allocation method and device for controlling web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610243866.4A CN107305548B (en) 2016-04-18 2016-04-18 Task allocation method and device for controlling web crawler

Publications (2)

Publication Number Publication Date
CN107305548A true CN107305548A (en) 2017-10-31
CN107305548B CN107305548B (en) 2020-02-28

Family

ID=60152696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610243866.4A Active CN107305548B (en) 2016-04-18 2016-04-18 Task allocation method and device for controlling web crawler

Country Status (1)

Country Link
CN (1) CN107305548B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885590A (en) * 2017-11-30 2018-04-06 百度在线网络技术(北京)有限公司 Task processing method and device for smart machine
CN109783229A (en) * 2018-12-17 2019-05-21 平安普惠企业管理有限公司 The method and device of thread resources distribution
CN109840149A (en) * 2019-02-14 2019-06-04 百度在线网络技术(北京)有限公司 Method for scheduling task, device, equipment and storage medium
CN109857547A (en) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 A kind of thread distribution method, device and terminal device
CN110187957A (en) * 2019-05-27 2019-08-30 北京奇艺世纪科技有限公司 A kind of queuing strategy of downloading task, device and electronic equipment
CN117807294A (en) * 2024-02-28 2024-04-02 深圳市豪斯莱科技有限公司 Multithread web crawler scheduling management method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101070184B1 (en) * 2011-02-24 2011-10-07 주식회사 윈스테크넷 System and method for blocking execution of malicious code by automatically crawling and analyzing malicious code through multi-thread site-crawler, and by interworking with network security device
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101070184B1 (en) * 2011-02-24 2011-10-07 주식회사 윈스테크넷 System and method for blocking execution of malicious code by automatically crawling and analyzing malicious code through multi-thread site-crawler, and by interworking with network security device
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885590A (en) * 2017-11-30 2018-04-06 百度在线网络技术(北京)有限公司 Task processing method and device for smart machine
US11188380B2 (en) 2017-11-30 2021-11-30 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for processing task in smart device
CN109783229A (en) * 2018-12-17 2019-05-21 平安普惠企业管理有限公司 The method and device of thread resources distribution
CN109857547A (en) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 A kind of thread distribution method, device and terminal device
CN109840149A (en) * 2019-02-14 2019-06-04 百度在线网络技术(北京)有限公司 Method for scheduling task, device, equipment and storage medium
CN109840149B (en) * 2019-02-14 2021-07-30 百度在线网络技术(北京)有限公司 Task scheduling method, device, equipment and storage medium
CN110187957A (en) * 2019-05-27 2019-08-30 北京奇艺世纪科技有限公司 A kind of queuing strategy of downloading task, device and electronic equipment
CN110187957B (en) * 2019-05-27 2022-06-03 北京奇艺世纪科技有限公司 Queuing method and device for downloading tasks and electronic equipment
CN117807294A (en) * 2024-02-28 2024-04-02 深圳市豪斯莱科技有限公司 Multithread web crawler scheduling management method and system
CN117807294B (en) * 2024-02-28 2024-05-28 深圳市豪斯莱科技有限公司 Multithread web crawler scheduling management method and system

Also Published As

Publication number Publication date
CN107305548B (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN107305548A (en) Control the method for allocating tasks and device of web crawlers
US20210099425A1 (en) Micro-segmentation of virtual computing elements
CN104239139B (en) Method, device and terminal for processing boot-strap self-starting project
DE112020000123T5 (en) PATCH MANAGEMENT IN A HYBRID DATA MANAGEMENT ENVIRONMENT
CN106529682A (en) Method and apparatus for processing deep learning task in big-data cluster
CN107018091A (en) The dispatching method and device of resource request
EP2642395B1 (en) Method and apparatus for executing work flow scripts
CN107609150A (en) A kind of interactive network reptile creation method chosen based on page elements and system
CN107526579A (en) A kind of application program page development management method and device
CN109388702B (en) Reading interaction method, electronic equipment and computer storage medium
CN106878042A (en) Container resource regulating method and system based on SLA
CN111209067A (en) Multimedia resource processing method and device, storage medium and computing equipment
CN107239563A (en) Public feelings information dynamic monitoring and controlling method
CN104811461B (en) Data push method and device
CN107357640A (en) Request processing method and device, the electronic equipment in multi-thread data storehouse
CN106897807A (en) A kind of business risk control method and equipment
CN104504004B (en) The sharing method and device shared for website
CN110064198A (en) Processing method and processing device, storage medium and the electronic device of resource
US20170168995A1 (en) Block configuration, a method of presenting, servers, terminal equipment and communications systems
CN105955747B (en) A kind of operation method of security software, relevant apparatus and electronic equipment
CN106294395B (en) A kind of method and device of task processing
CN106598726A (en) Multi-task management system and distributed deployment method thereof
CN106844467A (en) Method for exhibiting data and device
CN105989151A (en) Webpage crawling method and apparatus
CN109542617A (en) The processing method and processing device of system resource

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant