CN1791034A

CN1791034A - Detecting method

Info

Publication number: CN1791034A
Application number: CN 200410104063
Authority: CN
Inventors: 张磊; 龚华; 钱剑
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2004-12-13
Filing date: 2004-12-13
Publication date: 2006-06-21
Anticipated expiration: 2024-12-13
Also published as: CN100359865C

Abstract

The invention relates to a testing method. Wherein, it sets a testing field contained test return code and test result information code in HTTP protocol; the testing unit sends testing treatment request with testing field and treatment strategy to tested unit; tested unit executes operation according to said strategy and fills testing field with treatment result to send to testing unit; testing unit decides the health condition of tested unit for opposite operation. This invention is more flexible and deep to detect the inner running state of server, and notifies system manager or restarts the failed system.

Description

A kind of detection method

Technical field

The present invention relates to the communications field, relate in particular to a kind of detection method based on self-defined health detection protocol.

Technical background

Increasing website has appearred in Internet fast development, all trades and professions are more and more higher to the reliability requirement of server site, wherein the most urgent requirement is exactly to wish that the website can not paralyse in stable operation, even abnormal accident having occurred also wishes to recover in the shortest time normally to make the vibration of the imperceptible server of user with open arms.

In fact for a website, because the operation of the server of website relates to the stability of a lot of aspects, it is more and more unstable to comprise that operating system itself, application server container, basic service assembly, the long playing result of external communications equipment (such as database server, short messaging gateway, short message service center or the like) are certain to trend towards server, topmostly can not get dispatching, system in case of system halt will occur, server does not have response and the result of paralysis such as CPU, these resources of internal memory.

In the face of these realities, if we remove the influence of operating system itself, the influence of external system devices, and only pay close attention to server application itself, the existing technology of the stability of a system that can guarantee roughly has following several respects at present:

1, selects the application server container of some commercial comparative maturities, such as the WebSphere of IBM, WebLogic of BEA or the like.

2, the through performance test guarantees that application program itself does not have FAQs such as memory overflow.

3, adopt the load balancing scheme of many nodes.By a load-balancing device, a plurality of server in station of following extension carry out flow and all carry on a shoulder pole; The health of load-balancing device charge server detects, and finds that wherein a certain station server breaks down just customer traffic to be transferred to other above healthy server, thereby has realized non-interrupting service and to the function of user transparent.

4, the inside, the server room of Attended mode finds that systemic breakdown just restarts at once by hand.

At present the utilization comparative maturity widely technology be to rely on load-balancing device to carry out the traffic sharing of multiple servers, wherein a station server breaks down and just flowing of access is transferred to other and go above server, and the user has been realized shielding.

Main load-balancing device supplier has: the Huawei Company of Cisco, Nortel (Alteon), Foundry, Extreme, F5, China etc.

Fig. 1 is typical load scheme networking diagram, and as can be seen from Fig. 1, load-balancing device is placed on server front end, according to certain load-balancing algorithm customer flow is reasonably shared above the identical server of each content; The existence of imperceptible many nodes concerning the user is just as visiting same station server.

The visit of load-balancing device simulant-client, timing are sent the health detection messages to server, to check the health status of server operation, in case find not response or denial of service of a certain station server wherein, so immediately the user capture flow is transferred to other and go above normal server, guaranteed reliably running without interruption of website.

But obviously, there is significant disadvantages in this technology: the relative higher cost of load balancing scheme needs.

Therefore above schematic diagram is the networking plan of single load balancing node, and there is the hidden danger of Single Point of Faliure equally in load-balancing device, and networking plan needs that two load-balancing devices are mutually redundant to carry out networking more reliably.

The price of a load-balancing device generally about hundreds of thousands, adds the server expense of many load balancings, and the cost cost is relatively costly, is used for the very high commercial websites that can not be interrupted service of reliability requirement mostly.

In addition, the load balancing scheme can not be restarted server automatically, customer flow can only be transferred to other and go above healthy server; In case all servers all break down, will make the paralysis of whole website not recover.

In the prior art, also exist the health of a kind of HTML (Hypertext Markup Language) (HTTP Hypertext TransferProtocol) mode to detect (just most of load-balancing devices realize health detect), but its health that can only judge server by the answer back code of judging http protocol whether.Such as: 200 expressions of http protocol predetermined server echo reply sign indicating number are normal, the echo reply sign indicating number is represented server error more than 400.But under actual conditions, just also can't judge the health status of server more accurately by answer back code.Server process is unusual such as working as, show that handling the wrong page for one gives the user, the answer back code that returns equally is 200, but in fact we think that the website can't provide to the user normally and served this time, and be to judge the whether normal of server by answer back code this time only comprehensively.

Summary of the invention

The purpose of this invention is to provide and a kind ofly detected unit is carried out the timing health detect, effectively improve the reliability of detected unit and the method for failover capability.For this reason, the present invention adopts following technical scheme:

A kind of detection method is characterized in that being provided with a detected field in agreement; When detecting, may further comprise the steps:

Send request step: detecting unit sends the request of processing that detects to detected unit, and this detection is handled request and carried detected field and processing policy;

Detect step: detected unit is operated accordingly according to processing policy;

Response steps: detected unit is filled to result and sends to detecting unit in the detected field;

The analysis result step: detecting unit is according to the detected field content, judges the health status of detected unit and operates accordingly.

Described field contents comprises detection return code and testing result message code.

Comprise chosen contents such as detecting successfully, detect alarm or detection failure in the described detection return code.

Described method, this detection was carried out for the cycle.

Described processing policy comprises: application internal memory, release internal memory, write operation, read operation etc.

Described transmission request step further comprises:

Read the initial configuration parameter according to configuration file content, and the request of processing that detects is set.

Described configuration file also comprises the script command file that server restarts, so that detected unit restarts.

In the described analysis result step, also comprise according to the health status of tested server and selective system is normal, ALM and restart operation such as system.

Described method also comprises the result according to operation, generates the step of journal file.

Described agreement is HTML (Hypertext Markup Language) (HTTP Hypertext Transfer Protocol).

On the basis of the existing health detection technique of present each manufacturer, can bring the extra beneficial effect of following several respects by the present invention:

1, provide the open health of user to detect interface, the web station system keeper can be according to the characteristics of own system, protocol method according to the present invention's regulation, self-defined health detect successfully with the standard of failure, self-defined different situations under the object information that detects of health, can be more flexibly, the profound more running status that detects server system inside, just restart system immediately according to timely reporting system keeper of different states or systemic breakdown.

2, the invention provides perfect and outstanding design framework and the flow scheme of realizing above function, and through extensive simulation online operation, stability and reliability that can the operation of assurance system.

3, system provided by the present invention restart automatically function can thoroughly solve the load balancing scheme the fault recovery function that can't accomplish, can make server in paralyzed state recover normal operation rapidly.

Description of drawings

Fig. 1 is a load balancing design of the prior art;

Fig. 2 is a schematic flow sheet of the present invention;

Fig. 3 is a schematic diagram of the embodiment of the invention;

Fig. 4 is the related schematic diagram of executable file in the embodiment of the invention;

Fig. 5 is the control flow schematic diagram of main run entity in the embodiment of the invention;

Fig. 6 is the schematic flow sheet that detects thread in the embodiment of the invention.

Embodiment

Below in conjunction with Figure of description the specific embodiment of the present invention is described.In the present embodiment, detected unit is a server, detects the health status of this server by the detection method of self-defined health detection protocol, and operates accordingly according to testing result.

Present embodiment is provided with a detected field in http protocol, this field contents comprises detection return code and testing result message code, and is as shown in table 1, is the particular content of this detected field.

Table 1:HTTP field

HTTP field name	Health-Check
HTTP field name	Health-Check		Field contents form	ResultCode=[detects return code]; ResultInfo=[testing result message]
The span of ResultCode and implication	0	Detect successfully	Field contents form
	0	Detect successfully	1	Detect alarm
	Other	Detect failure	1	Detect alarm
	Other	Detect failure	The span of ResultInfo and implication	Null	There is not testing result message
Arbitrary string	Testing result message			Null	There is not testing result message

As seen, we are specifically designed to the field that health detects from being provided with one in a field of http protocol from table 1: " Health-Check " is used for representing the result that health detects.

We are as follows to the content provided of this field: ResultCode=[detects return code]; ResultInfo=[testing result message].It is formed by detecting return code and testing result message two parts.

Detect the implication of return code: 0 expression detects successfully, 1 expression detects alarm, other expressions detect failure.

Testing result message is represented the object information that this health detects, with detecting the concrete condition that return code one is used from this detection of prompting user.

As shown in Figure 2, be the schematic flow sheet of present embodiment, as seen from the figure, specifically may further comprise the steps:

Send request step: detecting unit sends the request of processing that detects to detected unit, this detection is handled request and is carried detected field and processing policy, this detected field promptly is the field that sets previously, and this processing policy please and discharge associative operation requirements such as internal memory in can being;

Detect step: detected unit is operated accordingly according to processing policy, and detected unit is handled the processing policy that carries in the request according to this detection and operated accordingly;

In the present embodiment, it is just passable that the WEB server program only needs as described above the health detection protocol to add HTTP field.

For instance, adopt method supervisory control system stability of the present invention, come the health status of detection system by this page of timer access http://localhost:8080/index.jsp such as some websites.Server application can specify different processing policies to judge the health status of itself in index.jsp, the return code and the return messages of filling Health-Check field according to the protocol format of top form the inside according to result (for example can apply for therein that an internal memory discharges internal memory then then, if success then show server health is provided with and detects successful answer back code and success message; If can't apply for internal memory then show that memory overflow detects failure, failure answer back code and failed message content are set, also can indicate detected server to make read operation or write operation etc., judge according to operating result.)。Judge the health status of server by the content of analyzing this self-defined field, and then alarm or restart operations such as system according to different strategies.

In this way can by the various server health whether standards of the customization of user flexibility, thus the answer back code of having broken away from single dependence HTTP itself is judged the limitation of server health status.

In the present embodiment, a concrete field method to set up can very simply be provided with according to as follows in jsp file:

<％response.setHeader(″Health-Check″，result)；％>

Result wherein is the set form character string of this agreement regulation.

Implementation method of the present invention can realize by hardware, also can realize by software, realizes illustrating realizability of the present invention below by software.As shown in Figure 3, be the schematic flow sheet of software realization mode of the present invention, as seen from the figure, in the technical scheme that software is realized, the technology contents that present embodiment embodied is provided with a software program, be called dongle (watchdog) program here.

The file structure of at first introducing the present invention and being generated, a complete and independent dongle program comprises following four files:

Executable file, configuration file, supporting paper, journal file, as shown in table 2 is the explanation of each file.

Table 2

Filename	File specification
Filename	File specification	Executable file	Be used for the executable file of executive software dog program
Configuration file	All configuration parameters of dongle here define	Executable file
Configuration file	All configuration parameters of dongle here define	Supporting paper	Dongle configuration parameter supporting paper
Journal file	Dongle log record file	Supporting paper	Dongle configuration parameter supporting paper

In executable file, it is to be made of four program entity shown in the table 3 simultaneously:

Table 3

Filename	File specification
Filename	File specification	Main	The main run entity of dongle
WatchDog	The execution entity of dongle control flow	Main	The main run entity of dongle
WatchDog	The execution entity of dongle control flow	HeartBeatThread	Dongle heartbeat thread entity carries out the timing health specially and detects
RunThread	Catch dongle and generate the type information of sub-thread	HeartBeatThread

More than four program files main relation as shown in Figure 4:

As shown in Figure 4, the health testing process of Main major control whole software dog, the method for recursive call dongle is carried out various control actions;

Dongle has mainly encapsulated various control methods, allows Main and HeartBeatThread call, and finishes the execution of various control actions;

HeartBeatThread moves an independently thread, carries out health regularly specially and detects, and the result that detection is obtained passes to dongle and preserves processing simultaneously;

The function of RunThread is just obtained the output stream of subprocess, so that system can be redirected to type information the control desk of dongle after restarting.Because after systemic breakdown is restarted, server application is originally come by the dongle pull-up, and then become a subprocess of dongle, go to be shown to the keeper above must the console message of original server application could being redirected the control desk that prints to dongle by the output stream that dongle is caught this subprocess.

Below we describe the control flow design of Main master's run entity, please see the control flow schematic diagram of Main master's run entity as shown in Figure 5:

Contrast Fig. 5, Main master's run entity The whole control flow process of dongle is as described below:

1, the dongle program enters main run entity once starting.At first carry out initial work, all configuration parameters are read from configuration file and preserve.These configuration parameters are that health detects necessary element, if the user does not have configuration will use the value of acquiescence.Main configuration parameter please refer to shown in the table 4:

Table 4

Keyword	Illustrate/default value
Keyword	Illustrate/default value	watchdog/request-page	Health detects request
watchdog/heartbeat-interval	Health interval detection time/10 seconds	watchdog/request-page	Health detects request

watchdog/max-error-times	Detect the number of retries of failure,, think systemic breakdown/10 time so if these number of times of continuous detecting are all failed
watchdog/max-error-times		watchdog/heartbeat-timeout	Detect the stand-by period when not having response,, think so and detect failure/60 seconds if do not reply with interior during this period of time
watchdog/time-of-wait-restarted	The stand-by period is restarted in system, and what system restarted detects/90 seconds with the interior health that will not carry out during this period of time	watchdog/heartbeat-timeout
watchdog/time-of-wait-restarted		watchdog/logfile	Journal file name/dongle
watchdog/cmd-str	Restart the script command of system, the detection system paralysis time will be carried out this and order and restart system	watchdog/logfile	Journal file name/dongle

2, the initialized while will be created an independently health detection thread HeartBeatThread, is used for carrying out health specially and detects.Main then run entity will be waited for there always and block, and detects to finish up to the HeartBeatThread thread and wakes it up;

In case 3 HeartBeatThread threads detect and to finish, main run entity is waken up by it, and the method that continues to call dongle is provided with current system mode.Contrast previous testing result and judge whole system current be ACTIVE or DEAD or ABNORMAL according to the result of HeartBeatThread thread to this detection this time.ACTIVE represents that system is normal, DEAD represents that systemic breakdown, ABNORMAL represent that this detects failure, but also do not reach the intermediateness of the maximum frequency of failure.

4, according to the current system mode that draws previously, further to its judgement.If system mode is not ACTIVE, so error message is write daily record and alarm, perhaps send Email system for prompting keeper; If reached the DEAD state, call external command so and restart system, will restart record and write journal file.Will wait for after system restarts that configuration parameter is described restarts latency period, and dongle will can not carry out any health and detect action during this period of time.

5, above step complete after, main run entity continues to repeat to enter the 2nd the described wait state of step, waits for that the HeartBeatThread thread wakes it up do next time health testing result and judges, so constantly repeats down.

Below we describe the design cycle scheme that health independently detects thread HeartBeatThread, as shown in Figure 6:

HeartBeatThread has moved an independently thread (we are referred to as the heartbeat thread), is responsible for specially regularly and carries out the health testing, at last testing result is saved in the dongle.Whole thread hangs over above the run () method of carrying out endless loop, periodically carries out health and detects.Concrete execution in step please be seen following description:

1, at first wait for the regularly assay intervals time, this moment, this thread was in the sleep state, and the default interval time was 10 seconds.

2, this testing result then is set and is failure in dongle.Have only that testing result just is set once more is successfully when this detects successfully, so in implementation after this, will think that all this detects failure in case any withdrawing from unusually taken place.

3, create HTTP with server and connect, this process comprises the overall process of whole HTTP request-reply pattern, can directly call by the interface for network programming that JDK provides and finish.Connect the health testing process of finishing core by this, the URL that the content of request reads when being the dongle initialization (value of watchdog/request-page just) from configuration file.If successful connection then continue following step and resolve response message; If connection failure then retry 3 times, all the words of failure just directly withdraw from, and this testing result is set is failure.The process of this connection needs also that the request of following sends to server end together as the value of Cookie with the resulting session id of previous detection, purpose is to make that each request all is to use same session id, avoids server end constantly to create new health and detects session id and waste memory source.

4 if dongle detects server for the first time, the general meeting of server is along with the subsidiary Cookie who comprises session id (sessionID) of http response, in order to allow client be used for discerning and keeping the consistency of session along with asking to attach this Cookie together to server end equally when continue request next time so.Therefore the dongle value that will search all Cookie in HTTP the field at this moment, if find to exist session id, so it is preserved, so that said in the step 3 in the above this session id is taken to server together and brings in and keep the session consistency along with health detection messages next time.

5, search and resolve a self-defining HTTP field " Health-Check ", take out self-defined answer back code " ResultCode " and self-defined response result information " ResultInfo ".

6, the answer back code that returns by analysis is judged the result that this health detects.

7, this testing result and testing result information are set in dongle.

8, waking the Main main thread up comes this testing result is further handled.

9, repeating step 1, waits for blanking time and continues to prepare to carry out health detection next time.

Relevant issues explanation in the present embodiment:

1, we have given an example how the value of a self-defining HTTP field is set among the JSP in health detection protocol part, simultaneously be not limited in actual applications in JSP, carry out, as long as can visit the place that HTTP replys this self-defined field can be set in that application program is any, so just provide the very convenient detection of health flexibly to customize the space to the user.

2, the present invention lays particular emphasis on setting and the realization of describing self-defined health detection protocol, provides self-defined health to detect clear and definite user interface and has described.The present invention is same to be supported and compatible health detection criterion based on the HTTP answer back code.If the user wishes to use traditional HTTP answer back code mode to detect the web station system of oneself, the same modification configuration file that only needs, close self-defined health detection protocol switch, the answer back code scope of configuration expression normal answer back code scope of server and sign server failure is just passable.

3, all use Website server program of the present invention that a script command that can allow server restart preferably all will be provided, so that can allow dongle find this service of how restarting in servers go down.This script command need be configured by the watchdog/cmd-str node in configuration file.

4, dongle of the present invention and Website server application program operate on the same station server, and operating system is not limit.If but the user does not need to restart website service, and only need alarm record and reminding e-mail, can be used to monitor any remote server equally remote server.

5, health detects in the execution flow process of thread HeartBeatThread obtaining of the session id that illustrates in 3,4 two steps and the preservation technology also can realize configurable.If the user wishes each same session of use that detects, it can be configured to the session maintained switch and open; If the user wishes to come detection system by different sessions, also can be configured to the session maintained switch and close.Therefore dongle provides session flexibly to keep mechanism.

Below only be several specific implementation of the present invention, yet protection scope of the present invention is not limited thereto, the scheme of with good grounds this kind principle design, all should be in protection scope of the present invention.

Claims

1, a kind of detection method is characterized in that being provided with a detected field in agreement; When detecting, may further comprise the steps:

2, the method for claim 1 is characterized in that described field contents comprises detection return code and testing result message code.

3, the method for claim 1 is characterized in that comprising in the described detection return code and detects successfully, detects alarm or detect the failure chosen content.

4, the method for claim 1 is characterized in that this detection carries out for the cycle.

5, the method for claim 1 is characterized in that described processing policy, comprising: application internal memory, release internal memory, write operation, read operation etc.

6, the method for claim 1 is characterized in that described transmission request step further comprises:

7, method as claimed in claim 6 is characterized in that described configuration file also comprises the script command file that server restarts, so that detected unit restarts.

8, the method for claim 1 is characterized in that in the described analysis result step, also comprises according to the health status of tested server and selective system is normal, ALM and restart system operation.

9, the method for claim 1 is characterized in that also comprising the result according to operation, generates the step of journal file.

10, the method for claim 1 is characterized in that described agreement is HTML (Hypertext Markup Language) (HTTP Hypertext Transfer Protocol).