CN105302527A - Thread organization method - Google Patents

Thread organization method Download PDF

Info

Publication number
CN105302527A
CN105302527A CN201510716958.5A CN201510716958A CN105302527A CN 105302527 A CN105302527 A CN 105302527A CN 201510716958 A CN201510716958 A CN 201510716958A CN 105302527 A CN105302527 A CN 105302527A
Authority
CN
China
Prior art keywords
thread
task
recovery
fault
threads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510716958.5A
Other languages
Chinese (zh)
Other versions
CN105302527B (en
Inventor
马应龙
高延太
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN201510716958.5A priority Critical patent/CN105302527B/en
Publication of CN105302527A publication Critical patent/CN105302527A/en
Application granted granted Critical
Publication of CN105302527B publication Critical patent/CN105302527B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a thread organization method. The method comprises the steps of: A, performing operation monitoring on a next thread by a thread in a mounting point loop, checking whether a fault exists or not, if the fault exists, picking off a problematic thread from the mounting point loop, feeding the problematic thread into a thread recovery region, judging whether the fault does not occur for the first time or not, if the fault does not occur for the first time, marking a next thread task as a problematic task, no longer performing the problematic task, and if the fault occurs for the first time, directly performing the step B; B, judging whether a global task needs to be executed or not, if a recovery task exists, executing the recovery task to recover all threads in the thread recovery region, adding a task without a problematic task mark into a task reliving list, and if the recovery task does not exist, performing the step C; and C, judging whether a task required to be relived exists in the task reliving list or not, and if so, generating a new thread for executing the task. With the thread organization method, the operational stability of network crawler threads can be improved.

Description

Sets of threads organization method
Technical field
The present invention relates to computer calculate scheduling of resource field, particularly relate to computer network computing applied technical field.
Background technology
Web crawlers, is a kind of according to certain rule, automatically captures program or the script of web message.According to the purpose of design of deviser, web crawlers may be implemented as various forms, often the implementation of web crawlers all can adopt multithreading, and the design of its thread often has complicated loop computation, the necessity that these situations manage single thread operation conditions in all causing web crawlers to design.
On the other hand, computing machine thread, it is the minimum unit that program performs stream, by the base unit that system is independently dispatched and assigned, thread oneself does not have system resource, only have the requisite resource that is in operation a bit, but it can share whole resources that process has with other thread belonging to a process together.
The very important ingredient of of multithreading is daemon thread, and the difference of daemon thread and common thread is, once all common thread end of run all in a process, then no matter whether daemon thread is in operation, and process all can terminate to run.Daemon thread is usually used to the subsidiary function run common thread.
Stable nucleus is a kind of by daemon thread being built in active thread operation body interior, makes mutually to guard between thread, and then makes all threads in process, form a kind of address of concentrated operational mode.
The difference of the mode of stable nucleus and traditional process concurrent program is, the task work of daemon thread is built in common worker thread by stable nucleus, such one side avoids the unsupervised contradiction of daemon thread self-operating, decreases on the other hand and frequently switch the time delay caused between the concurrent lower thread of height.
As shown in Figure 1, stable nucleus is made up of 2 parts, mount point loop and thread recovery area.Mount point is the operation tagged object corresponding to each thread, containing the information that thread runs.Carry out the operations such as inspection all to be carried out as intermediary by mount point thread.
The structure of mount point loop as shown in Figure 2.Wherein 1,2,3,4,5,6,7,8 is the mount point that 8 threads are corresponding, and these mount points are logically in the precedence relationship of a chain type, and front and back connect composition loop.
Thread recovery area as shown in Figure 3 then can form with a chain type mount point queue, and it is only kept in the mount point gone wrong, if the mount point of thread 1,2,3 is placed into recovery area, the structure of recovery area then as shown in Figure 3.In normal course of operation, mount point hangs over through being everlasting between a loop and thread recovery area and moves, if thread fault is made a definite diagnosis, then finally can be closed in thread recovery area and reclaim.
Multi-threaded network reptile in operational process often due to the network information change and crawl the frequent transitions of strategy and be absorbed in some dangerous code and cause the phenomenon of " seemingly-dead ", this kind of situation often checks to waste time and energy and does not have versatility.
Summary of the invention
Given this, the object of the invention is to overcome multi-threaded network reptile in prior art in operational process often due to the network information change and crawl the frequent transitions of strategy and be absorbed in some dangerous code and cause the problems such as the phenomenon of " seemingly-dead ", a kind of sets of threads organization method is proposed, by a kind of stable nucleus technology with versatility to realize the lifting of the operation stability of web crawlers, the operation stability of web crawlers thread greatly can be improved.
In order to realize this object, the technical scheme that the present invention takes is as follows.
A kind of sets of threads organization method, described method comprises step:
A: on mount point loop, a thread carries out operational monitoring to its next thread, checks there is non-fault, if there is fault, problem thread sent into thread recovery area from taking from mount point loop, and judges whether right and wrong occur described fault first,
If wherein described fault belongs to non-and occurs first, be then problem task by next thread task flagging described, problem task no longer performs,
If described fault belongs to occur first, then directly enter step B;
B, judge whether to need to perform overall task,
If there is recovery task, perform recovery task, reclaim the thread in all thread recovery areas, and the task of no problem task flagging is added resurrection task list,
If without recovery task, enter step C;
C, judge to bring back to life in task list whether having and needing to bring back to life of task, if having, generate new thread and perform this task.
In addition, taking a step forward of described steps A comprises:
A01, startup multithreading crawler system;
The connection of A02, initialization multithreading reptile and database, and check reptile run needed for database whether effective;
A03, system loads shared resource;
A04, system generate multiple thread in thread pool, and each thread independent parsing goes out secondary inlet;
The secondary inlet oneself obtained is sent into public filter by A05, each thread, selects unduplicated secondary inlet;
A06, in thread pool, generate new thread secondary inlet is crawled and resolves, and by result stored in database.
Distributed consensus method is wherein utilized to determine the global calculation thread of the recovery task being responsible for next thread described.
And utilize public Bloom filter to select unduplicated secondary inlet.
By adopting sets of threads organization method of the present invention, utilize a kind of stable nucleus structure of the mutual communications fabric between thread, process is endorsed to realize realizing automatically-monitored to the operation of key job thread in process by construction of stable, substantially increases the operation stability of worker thread.Take the standardization calling interface that stable nucleus provides simultaneously, it also avoid the time and efforts spent by daemon thread of independent design key worker thread, improve software development efficiency.
Accompanying drawing explanation
Fig. 1 is the structural representation of stable nucleus.
Fig. 2 is the structural representation of the mount point loop in stable nucleus.
Fig. 3 is the structural representation of the thread recovery area in stable nucleus.
Fig. 4 is the schematic flow sheet of specific embodiment of the invention thread method for organizing.
Fig. 5 is the schematic flow sheet of specific embodiment of the invention thread method for organizing.
Fig. 6 a-6d is the mount point loop of stable nucleus in the specific embodiment of the invention and the schematic diagram of thread recovery area.
Embodiment
Below in conjunction with accompanying drawing, the present invention is elaborated.
The example embodiment that following discloses are detailed.But concrete structure disclosed herein and function detail are only the objects for describing example embodiment.
But should be appreciated that, the present invention is not limited to disclosed concrete example embodiment, but covers all modifications, equivalent and the alternative that fall within the scope of the disclosure.In the description to whole accompanying drawing, identical Reference numeral represents identical element.
Should be appreciated that, term "and/or" as used in this comprises one or morely relevant lists any of item and all combinations simultaneously.Should be appreciated that in addition, when parts or unit are called as " connection " or " coupling " to another parts or unit, it can be directly connected or coupled to miscellaneous part or unit, or also can there is intermediate member or unit.In addition, other words being used for describing relation between parts or unit should be understood according to identical mode (such as, " between " to " directly ", " adjacent " to " direct neighbor " etc.).
As illustrated in figures 4-5, the invention discloses a kind of sets of threads organization method, said method comprising the steps of:
A: on mount point loop, a thread carries out operational monitoring to its next thread, checks there is non-fault, if there is fault, problem thread sent into thread recovery area from taking from mount point loop, and judges whether right and wrong occur described fault first,
If wherein described fault belongs to non-and occurs first, be then problem task by next thread task flagging described, problem task no longer performs,
If described fault belongs to occur first, then directly enter step B;
B, judge whether to need to perform overall task,
If there is recovery task, perform recovery task, reclaim the thread in all thread recovery areas, and the task of no problem task flagging is added resurrection task list,
If without recovery task, enter step C;
C, judge to bring back to life in task list whether having and needing to bring back to life of task, if having, generate new thread and perform this task.
Because the present invention utilizes a kind of stable nucleus structure of the mutual communications fabric between thread, process is endorsed to realize realizing automatically-monitored to the operation of key job thread in process by construction of stable, substantially increases the operation stability of worker thread.Take the standardization calling interface that stable nucleus provides simultaneously, it also avoid the time and efforts spent by daemon thread of independent design key worker thread, improve software development efficiency.
Wherein, taking a step forward of described steps A comprises:
A01, startup multithreading crawler system;
The connection of A02, initialization multithreading reptile and database, and check reptile run needed for database whether effective;
A03, system loads shared resource;
A04, system generate multiple thread in thread pool, and each thread independent parsing goes out secondary inlet;
The secondary inlet oneself obtained is sent into public filter by A05, each thread, selects unduplicated secondary inlet;
A06, in thread pool, generate new thread secondary inlet is crawled and resolves, and by result stored in database.
Especially, distributed consensus method is utilized to determine the global calculation thread of the recovery task being responsible for next thread described.
In addition, public Bloom filter is utilized to select unduplicated secondary inlet.
Below by way of a concrete example, technique effect of the present invention is described.
In this example, portal page is set to the homepage of the model list of a mhkc of Baidu's mhkc by us:
http://tieba.baidu.com/f?kw=%E9%AD%85%E6%97%8F&ie=utf-8
By this page write into Databasce, and by its each feature of parsing template also write into Databasce, and generate the page info storage list of this page.Preparation before system starts just completes.
Step 1.1: start up system.
Step 1.2: systems inspection database, has portal page, page parsing template and page info storage list in database, system starts smoothly.
Step 1.3: the core framework of system initialization public Bloom filter and generation stable nucleus, system starts generating run reptile thread afterwards.
Step 1.4: generate a reptile thread A from thread pool, A loads example portal page and crawls from database, parses four secondary page links after climbing to the page:
Page α: http://tieba.baidu.com/p/3953810314
Page β: http://tieba.baidu.com/p/3969039572
Page γ: http://tieba.baidu.com/p/3969049452
Page δ: http://tieba.baidu.com/p/3969020668
Step 1.5: these four page links are admitted to public Bloom filter and filter, checks and finds that α, β and δ tri-pages did not crawl.
Step 1.6: generate from thread pool three independently thread B, C and D respectively these three pages are crawled, afterwards by the result that crawls stored in database.
In upper example, as shown in Figure 6 a, last operation phase of system will have 4 thread ABCD to run simultaneously, and now stable nucleus is made up of four threads, the following describes countermeasure when to there is a kind of fault when this stage:
Suppose that in this stage, A, C, D tri-threads all normally run, B thread resolves template to this content disappearance not corresponding countermeasure when the secondary page α distributed it resolves due to page part content disappearance, therefore cause thread B to run and be absorbed in endless loop, now B thread is absorbed in abnormal operating state, cannot be out of service, also outwards cannot send signal.
What now A thread just in time terminated its portal page crawls task, enters the B thread of steps A thread to its bottom and detects, and finds that B thread is absorbed in endless loop or blocked state.So B thread is sticked dead mark and take feeding recovery area from stable nucleus by A thread.Current stable nucleus state is that mount point loop contains A, C, D tri-threads, and thread recovery area contains B thread, as shown in Figure 6 b.
Because there has been the thread B needing to reclaim thread recovery area, thus create a recovery task of overall importance.This recovery task is comparatively consuming time, therefore needs A, C and D tri-healthy threads to carry out load distribution.Following period of time hypothesis thread A, C, D below all ran step B and judged whether that taking turns to oneself carried out global calculation, now select A, C two threads by the distribution of computation tasks algorithm of stable nucleus not need to carry out global calculation, and D thread needs to carry out global calculation, therefore the recovery task of B thread performs global calculation by D thread.
D thread enters step C and performs recovery task, is checked through B thread in thread recovery area still in operation, so B thread is closed by D thread by force, makes B thread discharge its resource, and delete from system.In addition, D does when thread reclaims inspection and finds that the secondary page α crawled required for B thread does not still crawl, and this task performs first, makes mistakes to be caused by external factor, therefore generates new E thread and again crawl secondary page α.Now current stable nucleus state is that mount point loop contains A, C, D, E tetra-threads, and thread recovery area is empty, as fig. 6 c.
Because B thread operation troubles is that thus E still can cause thread E to occur the operation troubles the same with B because of content disappearance after downloading secondary page α because the parsing template of secondary page α lacks the processing scheme deficiency of this specific question to content.E thread repeats the same situation of B thread afterwards, is admitted to thread recovery area, but difference is to find when steps A performed by E thread to be that second time performs and makes mistakes to secondary page α, therefore no longer generates secondary page α and crawls task.Now current stable nucleus state is that mount point loop contains A, C, D tri-threads, and thread recovery area is empty, as shown in fig 6d.
It should be noted that; above-mentioned embodiment is only the present invention's preferably embodiment; can not limiting the scope of the invention be understood as, not depart under concept thereof of the present invention, all protection scope of the present invention is belonged to modification to any subtle change that the present invention does.

Claims (4)

1. a sets of threads organization method, described method comprises step:
A: on mount point loop, a thread carries out operational monitoring to its next thread, checks there is non-fault, if there is fault, problem thread sent into thread recovery area from taking from mount point loop, and judges whether right and wrong occur described fault first,
If wherein described fault belongs to non-and occurs first, be then problem task by next thread task flagging described, problem task no longer performs,
If described fault belongs to occur first, then directly enter step B;
B, judge whether to need to perform overall task,
If there is recovery task, perform recovery task, reclaim the thread in all thread recovery areas, and the task of no problem task flagging is added resurrection task list,
If without recovery task, enter step C;
C, judge to bring back to life in task list whether having and needing to bring back to life of task, if having, generate new thread and perform this task.
2. according to the sets of threads organization method described in claim 1, it is characterized in that, taking a step forward of described steps A comprises:
A01, startup multithreading crawler system;
The connection of A02, initialization multithreading reptile and database, and check reptile run needed for database whether effective;
A03, system loads shared resource;
A04, system generate multiple thread in thread pool, and each thread independent parsing goes out secondary inlet;
The secondary inlet oneself obtained is sent into public filter by A05, each thread, selects unduplicated secondary inlet;
A06, in thread pool, generate new thread secondary inlet is crawled and resolves, and by result stored in database.
3. according to the sets of threads organization method described in claim 1, it is characterized in that, utilize distributed consensus method to determine the global calculation thread of the recovery task being responsible for next thread described.
4. according to the sets of threads organization method described in claim 2, it is characterized in that, utilize public Bloom filter to select unduplicated secondary inlet.
CN201510716958.5A 2015-10-29 2015-10-29 Thread method for organizing Expired - Fee Related CN105302527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510716958.5A CN105302527B (en) 2015-10-29 2015-10-29 Thread method for organizing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510716958.5A CN105302527B (en) 2015-10-29 2015-10-29 Thread method for organizing

Publications (2)

Publication Number Publication Date
CN105302527A true CN105302527A (en) 2016-02-03
CN105302527B CN105302527B (en) 2018-01-19

Family

ID=55199831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510716958.5A Expired - Fee Related CN105302527B (en) 2015-10-29 2015-10-29 Thread method for organizing

Country Status (1)

Country Link
CN (1) CN105302527B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108096837A (en) * 2016-11-25 2018-06-01 盛趣信息技术(上海)有限公司 Game robot dynamic identifying method
CN109542632A (en) * 2018-11-30 2019-03-29 郑州云海信息技术有限公司 A kind of method and device handling access request
CN114546980A (en) * 2022-04-25 2022-05-27 成都云祺科技有限公司 Backup method, system and storage medium of NAS file system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030014471A1 (en) * 2001-07-12 2003-01-16 Nec Corporation Multi-thread execution method and parallel processor system
CN101446953A (en) * 2008-11-25 2009-06-03 北京邮电大学 Parallel associated notice board crawler system
CN102375758A (en) * 2010-08-20 2012-03-14 联芯科技有限公司 Method and device for preventing apparent death of browser of mobile communication equipment
US20130179730A1 (en) * 2012-01-09 2013-07-11 Samsung Electronics Co., Ltd. Apparatus and method for fault recovery
US20130332941A1 (en) * 2012-06-08 2013-12-12 Apple Inc. Adaptive Process Importance
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN103902452A (en) * 2014-04-01 2014-07-02 浙江大学 Self-repair algorithm for software multi-point faults

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030014471A1 (en) * 2001-07-12 2003-01-16 Nec Corporation Multi-thread execution method and parallel processor system
CN101446953A (en) * 2008-11-25 2009-06-03 北京邮电大学 Parallel associated notice board crawler system
CN102375758A (en) * 2010-08-20 2012-03-14 联芯科技有限公司 Method and device for preventing apparent death of browser of mobile communication equipment
US20130179730A1 (en) * 2012-01-09 2013-07-11 Samsung Electronics Co., Ltd. Apparatus and method for fault recovery
US20130332941A1 (en) * 2012-06-08 2013-12-12 Apple Inc. Adaptive Process Importance
CN103902452A (en) * 2014-04-01 2014-07-02 浙江大学 Self-repair algorithm for software multi-point faults
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DANIEL D等: "Blackboard and Multi-Agent Systems & the Future", 《COLLABORATING SOFTWARE》 *
张超等: "多线程网络爬虫的设计与实现", 《电脑开发与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108096837A (en) * 2016-11-25 2018-06-01 盛趣信息技术(上海)有限公司 Game robot dynamic identifying method
CN109542632A (en) * 2018-11-30 2019-03-29 郑州云海信息技术有限公司 A kind of method and device handling access request
CN114546980A (en) * 2022-04-25 2022-05-27 成都云祺科技有限公司 Backup method, system and storage medium of NAS file system
CN114546980B (en) * 2022-04-25 2022-07-08 成都云祺科技有限公司 Backup method, system and storage medium of NAS file system

Also Published As

Publication number Publication date
CN105302527B (en) 2018-01-19

Similar Documents

Publication Publication Date Title
CN103152419B (en) A kind of high availability cluster management method of cloud computing platform
CN103873279B (en) Server management method and server management device
CN101996106B (en) Method for monitoring software running state
US9342426B2 (en) Distributed system, server computer, distributed management server, and failure prevention method
CN104461752A (en) Two-level fault-tolerant multimedia distributed task processing method
CN107203552B (en) Garbage recovery method and device
JP2005346331A (en) Failure recovery apparatus, method for restoring fault, manager apparatus, and program
CN103618762A (en) System and method for enterprise service bus state pretreatment based on AOP
CN101477386B (en) Timer implementing method and apparatus
CN105302527A (en) Thread organization method
CN103399787B (en) A kind of MapReduce operation streaming dispatching method and dispatching patcher calculating platform based on Hadoop cloud
CN102662725A (en) Event-driven high concurrent process virtual machine realization method
CN103713974A (en) High-performance job scheduling management node dual-computer reinforcement method and device
CN108632106A (en) System for monitoring service equipment
CN103197927B (en) A kind of method that realizes of Workflow and system thereof
CN103309796A (en) Monitoring method and device of component object model (COM) object
CN111176783A (en) High-availability method and device for container treatment platform and electronic equipment
CN101373450A (en) Method and system for processing CPU abnormity
SE500940C2 (en) Methods and systems for dismantling a chain of linked processes in a distributed operating system
CN112637263A (en) Multi-data center resource optimization promotion method and system and storage medium
CN117130730A (en) Metadata management method for federal Kubernetes cluster
CN103019849B (en) Virtual machine management method under cloud computing environment
CN101216802B (en) Cross debugger conditional breakpoint accomplishing method
CN105119836B (en) A kind of routing protocol component dynamic operation method based on state pool
CN107291589A (en) Method for improving system reliability in robot operating system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180119

Termination date: 20181029

CF01 Termination of patent right due to non-payment of annual fee