CN105302527A

CN105302527A - Thread organization method

Info

Publication number: CN105302527A
Application number: CN201510716958.5A
Authority: CN
Inventors: 马应龙; 高延太
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2016-02-03
Anticipated expiration: 2035-10-29
Also published as: CN105302527B

Abstract

The invention discloses a thread organization method. The method comprises the steps of: A, performing operation monitoring on a next thread by a thread in a mounting point loop, checking whether a fault exists or not, if the fault exists, picking off a problematic thread from the mounting point loop, feeding the problematic thread into a thread recovery region, judging whether the fault does not occur for the first time or not, if the fault does not occur for the first time, marking a next thread task as a problematic task, no longer performing the problematic task, and if the fault occurs for the first time, directly performing the step B; B, judging whether a global task needs to be executed or not, if a recovery task exists, executing the recovery task to recover all threads in the thread recovery region, adding a task without a problematic task mark into a task reliving list, and if the recovery task does not exist, performing the step C; and C, judging whether a task required to be relived exists in the task reliving list or not, and if so, generating a new thread for executing the task. With the thread organization method, the operational stability of network crawler threads can be improved.

Description

Sets of threads organization method

Technical field

The present invention relates to computer calculate scheduling of resource field, particularly relate to computer network computing applied technical field.

Background technology

Web crawlers, is a kind of according to certain rule, automatically captures program or the script of web message.According to the purpose of design of deviser, web crawlers may be implemented as various forms, often the implementation of web crawlers all can adopt multithreading, and the design of its thread often has complicated loop computation, the necessity that these situations manage single thread operation conditions in all causing web crawlers to design.

On the other hand, computing machine thread, it is the minimum unit that program performs stream, by the base unit that system is independently dispatched and assigned, thread oneself does not have system resource, only have the requisite resource that is in operation a bit, but it can share whole resources that process has with other thread belonging to a process together.

The very important ingredient of of multithreading is daemon thread, and the difference of daemon thread and common thread is, once all common thread end of run all in a process, then no matter whether daemon thread is in operation, and process all can terminate to run.Daemon thread is usually used to the subsidiary function run common thread.

Stable nucleus is a kind of by daemon thread being built in active thread operation body interior, makes mutually to guard between thread, and then makes all threads in process, form a kind of address of concentrated operational mode.

The difference of the mode of stable nucleus and traditional process concurrent program is, the task work of daemon thread is built in common worker thread by stable nucleus, such one side avoids the unsupervised contradiction of daemon thread self-operating, decreases on the other hand and frequently switch the time delay caused between the concurrent lower thread of height.

As shown in Figure 1, stable nucleus is made up of 2 parts, mount point loop and thread recovery area.Mount point is the operation tagged object corresponding to each thread, containing the information that thread runs.Carry out the operations such as inspection all to be carried out as intermediary by mount point thread.

The structure of mount point loop as shown in Figure 2.Wherein 1,2,3,4,5,6,7,8 is the mount point that 8 threads are corresponding, and these mount points are logically in the precedence relationship of a chain type, and front and back connect composition loop.

Thread recovery area as shown in Figure 3 then can form with a chain type mount point queue, and it is only kept in the mount point gone wrong, if the mount point of thread 1,2,3 is placed into recovery area, the structure of recovery area then as shown in Figure 3.In normal course of operation, mount point hangs over through being everlasting between a loop and thread recovery area and moves, if thread fault is made a definite diagnosis, then finally can be closed in thread recovery area and reclaim.

Multi-threaded network reptile in operational process often due to the network information change and crawl the frequent transitions of strategy and be absorbed in some dangerous code and cause the phenomenon of " seemingly-dead ", this kind of situation often checks to waste time and energy and does not have versatility.

Summary of the invention

Given this, the object of the invention is to overcome multi-threaded network reptile in prior art in operational process often due to the network information change and crawl the frequent transitions of strategy and be absorbed in some dangerous code and cause the problems such as the phenomenon of " seemingly-dead ", a kind of sets of threads organization method is proposed, by a kind of stable nucleus technology with versatility to realize the lifting of the operation stability of web crawlers, the operation stability of web crawlers thread greatly can be improved.

In order to realize this object, the technical scheme that the present invention takes is as follows.

A kind of sets of threads organization method, described method comprises step:

A: on mount point loop, a thread carries out operational monitoring to its next thread, checks there is non-fault, if there is fault, problem thread sent into thread recovery area from taking from mount point loop, and judges whether right and wrong occur described fault first,

If wherein described fault belongs to non-and occurs first, be then problem task by next thread task flagging described, problem task no longer performs,

If described fault belongs to occur first, then directly enter step B;

B, judge whether to need to perform overall task,

If there is recovery task, perform recovery task, reclaim the thread in all thread recovery areas, and the task of no problem task flagging is added resurrection task list,

If without recovery task, enter step C;

C, judge to bring back to life in task list whether having and needing to bring back to life of task, if having, generate new thread and perform this task.

In addition, taking a step forward of described steps A comprises:

A01, startup multithreading crawler system;

The connection of A02, initialization multithreading reptile and database, and check reptile run needed for database whether effective;

A03, system loads shared resource;

A04, system generate multiple thread in thread pool, and each thread independent parsing goes out secondary inlet;

The secondary inlet oneself obtained is sent into public filter by A05, each thread, selects unduplicated secondary inlet;

A06, in thread pool, generate new thread secondary inlet is crawled and resolves, and by result stored in database.

Distributed consensus method is wherein utilized to determine the global calculation thread of the recovery task being responsible for next thread described.

And utilize public Bloom filter to select unduplicated secondary inlet.

By adopting sets of threads organization method of the present invention, utilize a kind of stable nucleus structure of the mutual communications fabric between thread, process is endorsed to realize realizing automatically-monitored to the operation of key job thread in process by construction of stable, substantially increases the operation stability of worker thread.Take the standardization calling interface that stable nucleus provides simultaneously, it also avoid the time and efforts spent by daemon thread of independent design key worker thread, improve software development efficiency.

Accompanying drawing explanation

Fig. 1 is the structural representation of stable nucleus.

Fig. 2 is the structural representation of the mount point loop in stable nucleus.

Fig. 3 is the structural representation of the thread recovery area in stable nucleus.

Fig. 4 is the schematic flow sheet of specific embodiment of the invention thread method for organizing.

Fig. 5 is the schematic flow sheet of specific embodiment of the invention thread method for organizing.

Fig. 6 a-6d is the mount point loop of stable nucleus in the specific embodiment of the invention and the schematic diagram of thread recovery area.

Embodiment

Below in conjunction with accompanying drawing, the present invention is elaborated.

The example embodiment that following discloses are detailed.But concrete structure disclosed herein and function detail are only the objects for describing example embodiment.

But should be appreciated that, the present invention is not limited to disclosed concrete example embodiment, but covers all modifications, equivalent and the alternative that fall within the scope of the disclosure.In the description to whole accompanying drawing, identical Reference numeral represents identical element.

Should be appreciated that, term "and/or" as used in this comprises one or morely relevant lists any of item and all combinations simultaneously.Should be appreciated that in addition, when parts or unit are called as " connection " or " coupling " to another parts or unit, it can be directly connected or coupled to miscellaneous part or unit, or also can there is intermediate member or unit.In addition, other words being used for describing relation between parts or unit should be understood according to identical mode (such as, " between " to " directly ", " adjacent " to " direct neighbor " etc.).

As illustrated in figures 4-5, the invention discloses a kind of sets of threads organization method, said method comprising the steps of:

If described fault belongs to occur first, then directly enter step B;

B, judge whether to need to perform overall task,

If without recovery task, enter step C;

Because the present invention utilizes a kind of stable nucleus structure of the mutual communications fabric between thread, process is endorsed to realize realizing automatically-monitored to the operation of key job thread in process by construction of stable, substantially increases the operation stability of worker thread.Take the standardization calling interface that stable nucleus provides simultaneously, it also avoid the time and efforts spent by daemon thread of independent design key worker thread, improve software development efficiency.

Wherein, taking a step forward of described steps A comprises:

A01, startup multithreading crawler system;

A03, system loads shared resource;

Especially, distributed consensus method is utilized to determine the global calculation thread of the recovery task being responsible for next thread described.

In addition, public Bloom filter is utilized to select unduplicated secondary inlet.

Below by way of a concrete example, technique effect of the present invention is described.

In this example, portal page is set to the homepage of the model list of a mhkc of Baidu's mhkc by us:

http://tieba.baidu.com/f？kw＝％E9％AD％85％E6％97％8F&ie＝utf-8

By this page write into Databasce, and by its each feature of parsing template also write into Databasce, and generate the page info storage list of this page.Preparation before system starts just completes.

Step 1.1: start up system.

Step 1.2: systems inspection database, has portal page, page parsing template and page info storage list in database, system starts smoothly.

Step 1.3: the core framework of system initialization public Bloom filter and generation stable nucleus, system starts generating run reptile thread afterwards.

Step 1.4: generate a reptile thread A from thread pool, A loads example portal page and crawls from database, parses four secondary page links after climbing to the page:

Page α: http://tieba.baidu.com/p/3953810314

Page β: http://tieba.baidu.com/p/3969039572

Page γ: http://tieba.baidu.com/p/3969049452

Page δ: http://tieba.baidu.com/p/3969020668

Step 1.5: these four page links are admitted to public Bloom filter and filter, checks and finds that α, β and δ tri-pages did not crawl.

Step 1.6: generate from thread pool three independently thread B, C and D respectively these three pages are crawled, afterwards by the result that crawls stored in database.

In upper example, as shown in Figure 6 a, last operation phase of system will have 4 thread ABCD to run simultaneously, and now stable nucleus is made up of four threads, the following describes countermeasure when to there is a kind of fault when this stage:

Suppose that in this stage, A, C, D tri-threads all normally run, B thread resolves template to this content disappearance not corresponding countermeasure when the secondary page α distributed it resolves due to page part content disappearance, therefore cause thread B to run and be absorbed in endless loop, now B thread is absorbed in abnormal operating state, cannot be out of service, also outwards cannot send signal.

What now A thread just in time terminated its portal page crawls task, enters the B thread of steps A thread to its bottom and detects, and finds that B thread is absorbed in endless loop or blocked state.So B thread is sticked dead mark and take feeding recovery area from stable nucleus by A thread.Current stable nucleus state is that mount point loop contains A, C, D tri-threads, and thread recovery area contains B thread, as shown in Figure 6 b.

Because there has been the thread B needing to reclaim thread recovery area, thus create a recovery task of overall importance.This recovery task is comparatively consuming time, therefore needs A, C and D tri-healthy threads to carry out load distribution.Following period of time hypothesis thread A, C, D below all ran step B and judged whether that taking turns to oneself carried out global calculation, now select A, C two threads by the distribution of computation tasks algorithm of stable nucleus not need to carry out global calculation, and D thread needs to carry out global calculation, therefore the recovery task of B thread performs global calculation by D thread.

D thread enters step C and performs recovery task, is checked through B thread in thread recovery area still in operation, so B thread is closed by D thread by force, makes B thread discharge its resource, and delete from system.In addition, D does when thread reclaims inspection and finds that the secondary page α crawled required for B thread does not still crawl, and this task performs first, makes mistakes to be caused by external factor, therefore generates new E thread and again crawl secondary page α.Now current stable nucleus state is that mount point loop contains A, C, D, E tetra-threads, and thread recovery area is empty, as fig. 6 c.

Because B thread operation troubles is that thus E still can cause thread E to occur the operation troubles the same with B because of content disappearance after downloading secondary page α because the parsing template of secondary page α lacks the processing scheme deficiency of this specific question to content.E thread repeats the same situation of B thread afterwards, is admitted to thread recovery area, but difference is to find when steps A performed by E thread to be that second time performs and makes mistakes to secondary page α, therefore no longer generates secondary page α and crawls task.Now current stable nucleus state is that mount point loop contains A, C, D tri-threads, and thread recovery area is empty, as shown in fig 6d.

It should be noted that; above-mentioned embodiment is only the present invention's preferably embodiment; can not limiting the scope of the invention be understood as, not depart under concept thereof of the present invention, all protection scope of the present invention is belonged to modification to any subtle change that the present invention does.

Claims

1. a sets of threads organization method, described method comprises step:

If described fault belongs to occur first, then directly enter step B;

B, judge whether to need to perform overall task,

If without recovery task, enter step C;

2. according to the sets of threads organization method described in claim 1, it is characterized in that, taking a step forward of described steps A comprises:

A01, startup multithreading crawler system;

A03, system loads shared resource;

3. according to the sets of threads organization method described in claim 1, it is characterized in that, utilize distributed consensus method to determine the global calculation thread of the recovery task being responsible for next thread described.

4. according to the sets of threads organization method described in claim 2, it is characterized in that, utilize public Bloom filter to select unduplicated secondary inlet.