CN105550100A - Method and system for automatic fault recovery of information system - Google Patents

Method and system for automatic fault recovery of information system Download PDF

Info

Publication number
CN105550100A
CN105550100A CN201510920960.4A CN201510920960A CN105550100A CN 105550100 A CN105550100 A CN 105550100A CN 201510920960 A CN201510920960 A CN 201510920960A CN 105550100 A CN105550100 A CN 105550100A
Authority
CN
China
Prior art keywords
warning information
described warning
infosystem
script
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510920960.4A
Other languages
Chinese (zh)
Inventor
闫龙川
张晓亮
崔硕
杨猛
毛一凡
刘冬梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201510920960.4A priority Critical patent/CN105550100A/en
Publication of CN105550100A publication Critical patent/CN105550100A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore

Abstract

The invention discloses a method for automatic fault recovery of an information system, comprising the following steps of: acquiring numerical values of monitoring status indicators of the information system; comparing the numerical value of each monitoring status indicator with a corresponding predetermined status indicator range, and determining warning information according to a comparison result; according to the warning information, selecting a corresponding BP neural network status analysis program to perform status evaluation on the warning information; according to the status evaluation result, calling corresponding matched scripts, and executing a recovery command according to the matched scripts. The method realizes automatic recovery of the warning information for the information system; the invention also discloses a system for automatic fault recovery for an information system.

Description

A kind of method and system of infosystem automatically restoring fault
Technical field
The present invention relates to data processing field, particularly a kind of method and system of infosystem automatically restoring fault.
Background technology
Along with infotech is in the generally application of all trades and professions, there is a large amount of infosystems, there are electronic business web site, the social media of Service Global, there is the dispatching and monitoring command system in the fields such as the electric power of government utility, traffic, weather, have the management such as the marketing of enterprise, finance, human resources to be infosystem.These infosystems are by while all kinds of production and operation and management service informationization, datumization and networking, and data center constantly drops into newly-increased equipment to carry a large amount of infosystems.For ensureing that user is to the needs of information system access, the stable operation incessantly in 7 × 24 hours of common demands infosystem, there is fault and the problem of indivedual software and hardware in system, can fast processing and recovery, do not affect the use of user, this has higher requirement to system survivability and robustness.
At present, infosystem generally adopts aggregated structure, in the hardware and software of infosystem, provide redundant configuration, when individual node goes wrong or fault, and try one's best not influential system overall operation or Consumer's Experience.User and system operation maintenance personnel all wish that the problem of system and fault can quick solution and recoveries, with the use of the processing power of not influential system, performance and user.
Large-scale data center deployment tens is hundreds of cover infosystem even, several ten thousand to tens0000 station server equipment, artificial problem and fault handling can not meet the requirement of system cloud gray model and business use, need the technical method that infosystem fault and problem are recovered automatically, reduce manual intervention, improve reliability and the problem fault self-recovery ability of infosystem entirety, improve robotization and the intelligent level of maintenance work.
Summary of the invention
The object of this invention is to provide a kind of method and system of infosystem automatically restoring fault, the method can recover infosystem fault and problem automatically, reduce manual intervention, improve reliability and the problem fault self-recovery ability of infosystem entirety, improve robotization and the intelligent level of maintenance work.
For solving the problems of the technologies described above, the invention provides a kind of method of infosystem automatically restoring fault, comprising:
The numerical value of the monitor state index of obtaining information system;
The numerical value of monitor state index described in each is compared with corresponding predetermined state indication range, and according to comparative result determination warning information;
According to described warning information, corresponding BP neural network state analyser is selected to carry out state estimation to described warning information;
According to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script.
Wherein, described according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information, comprising:
According to described warning information, judge whether described warning information belongs to knowledge base scope;
If belong to, then the BP neural network state analyser corresponding with described warning information is selected to carry out state estimation to described warning information.
Wherein, described according to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script, comprising:
S3, according to condition evaluation results, transfer and mate script accordingly;
S31, judge whether the continuous number of processes of described warning information exceedes corresponding threshold value;
If S32 does not exceed, then perform according to described coupling script and recover order, and verify whether described warning information recovers;
If S33 recovers, then terminate;
If S34 does not recover, then according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information, and enter S3.
Wherein, also comprise:
Record the automatically restoring fault process logs of described infosystem.
Wherein, also comprise:
Regularly according to the daily record of infosystem fault automatic recovery system, BP neural network state analyser and coupling script are safeguarded.
The invention provides a kind of system of infosystem automatically restoring fault, comprising:
Acquisition module, for the numerical value of the monitor state index of obtaining information system;
Warning information module, for comparing the numerical value of monitor state index described in each with corresponding predetermined state indication range, and according to comparative result determination warning information;
State estimation module, for according to described warning information, selects corresponding BP neural network state analyser to carry out state estimation to described warning information;
Recover module, for according to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script.
Wherein, described state estimation module comprises:
Range judging unit, for according to described warning information, judges whether described warning information belongs to knowledge base scope;
State evaluation unit, if for belonging to, then selects the BP neural network state analyser corresponding with described warning information to carry out state estimation to described warning information.
Wherein, described recovery module comprises:
Transfer unit, for according to condition evaluation results, transfer and mate script accordingly;
First judging unit, for judging whether the continuous number of processes of described warning information exceedes corresponding threshold value;
Performance element, if for not exceeding, then performs according to described coupling script and recovers order;
Authentication unit, for verifying whether described warning information recovers;
If do not recover, then trigger described state estimation module according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information.
Wherein, also comprise:
Log pattern, for recording the automatically restoring fault process logs of described infosystem.
Wherein, also comprise:
Maintenance module, for the regular daily record according to infosystem fault automatic recovery system, safeguards BP neural network state analyser and coupling script.
The method and system of infosystem automatically restoring fault provided by the present invention, comprising: the numerical value of the monitor state index of obtaining information system; The numerical value of monitor state index described in each is compared with corresponding predetermined state indication range, and according to comparative result determination warning information; According to described warning information, corresponding BP neural network state analyser is selected to carry out state estimation to described warning information; According to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script; The method can recover fault and the problem of infosystem automatically, reduces manual intervention, improves reliability and the problem fault self-recovery ability of infosystem entirety, improves robotization and the intelligent level of maintenance work.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.
The process flow diagram of the method for the infosystem automatically restoring fault that Fig. 1 provides for the embodiment of the present invention;
The typical information system architecture schematic diagram that Fig. 2 provides for the embodiment of the present invention;
The schematic diagram of the treatment mechanism of the infosystem automatically restoring fault that Fig. 3 provides for the embodiment of the present invention;
The schematic diagram of the method for the infosystem automatically restoring fault that Fig. 4 provides for the embodiment of the present invention;
The structured flowchart of the Verification System of the system integration that Fig. 5 provides for the embodiment of the present invention.
Embodiment
Core of the present invention is to provide a kind of method and system of infosystem automatically restoring fault, the method can recover infosystem fault and problem automatically, reduce manual intervention, improve reliability and the problem fault self-recovery ability of infosystem entirety, improve robotization and the intelligent level of maintenance work.
For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Please refer to Fig. 1, the process flow diagram of the method for the infosystem automatically restoring fault that Fig. 1 provides for the embodiment of the present invention; The method can comprise:
The numerical value of the monitor state index of S100, obtaining information system;
Wherein, typical information system structure comprises the compositions such as server, database, middleware, load balancing soft hardware equipment.Please refer to Fig. 2, load balancing is the entrance of external reference system, and distribution external reference is to different application servers, and application server access database server operates business datum.
In order to automatically process the failure message of infosystem, therefore, need to understand the common failure condition of infosystem, and classification summary is carried out to it.For ease of the typical failure condition of analytical information system, statistical study is carried out to the typical problem occurred in certain data center's recent information system cloud gray model and fault.From ratio, infosystem middleware services is unavailable, disk storage space is not enough, database table insufficient space, server performance not enough, hardware fault is more typical problem in infosystem O&M, accounts for more than 80% of daily process problem total amount.According to maintenance work experience, the problems referred to above and fault have the typical disposal route of obvious characteristic sum, therefore can sum up and perfect information system problem processing procedure Sum fanction, in conjunction with the system running state of monitoring record, form O&M knowledge base.Finally, by O&M knowledge base, instruct the automatic recovery of fault.
In order to can the fault of comformed information system accurately, the state index that can characterize the problems referred to above be obtained.The failure message that may produce according to infosystem determines the state index needing to carry out monitoring, and obtain the monitoring numerical value of these state indexs, here monitoring can be carried out in real time, also can distribute different monitoring frequencies according to the busy section of infosystem, the monitoring carrying out state index according to a monitoring frequency can also be unified.Therefore, monitoring frequency is determined according to actual conditions, does not limit here to monitoring frequency.
According to the monitoring to infosystem, can the state index such as running status, service conditions, system pressure of obtaining information system.
S110, the numerical value of monitor state index described in each to be compared with corresponding predetermined state indication range, and according to comparative result determination warning information;
Wherein, regime values scope predetermined to the numerical value and each index of monitoring the state index obtained is compared, if in normal range, then prove that this index is normal, if not in normal range, then prove that this index is abnormal; Using abnormal state of affairs index as warning information.And by this alarm information noticing to the automatic processing procedure of infosystem.Detailed process is as shown in following step.
S120, according to described warning information, corresponding BP neural network state analyser is selected to carry out state estimation to described warning information;
Wherein, according to warning information, corresponding BP neural network state analyser is selected to carry out state estimation to warning information; Warning information is generally memory space inadequate, server hardware fault, and business overloads, and database does not respond, and database performance is not enough.The typical problem of specifying information system and analytical technology situation be as shown in Table 1:
Table 1 typical problem and analytical technology situation
Wherein, the method adopts BP neural network to carry out state estimation and the capacity predict of infosystem.BP (BackPropagation) network is proposed by the scientist group headed by Rumelhart and McCelland for 1986, being a kind of Multi-layered Feedforward Networks by Back Propagation Algorithm training, is one of current most widely used neural network model.BP network can learn and store a large amount of input-output mode map relations, and without the need to disclosing the math equation describing this mapping relations in advance.Its learning rules use method of steepest descent, constantly adjusted the weights and threshold of network, make the error sum of squares of network minimum by backpropagation.
Suppose there be P training sample, existing P inputoutput pair (I p, T p), p=1,2 ... P.Wherein, input vector is: I p=(i p1..., i pm) t, target output vector is T p=(t p1..., t pn) t, network output vector (in theory):
O p=(o p1,...,o pn) T(1)
Note w ijfor from the jth of input vector (j=1 ..., m) individual to output vector i-th (i=1 ..., the n) weight of individual component.Usual theoretical value and actual value have certain error, e-learning then refer to constantly with compare, and according to minimum principle amendment parameter w ij, error sum of squares is reached minimum:
m i n Σ i = 1 n ( t p i - o p i ) 2 , ( p = 1 , ... , P ) - - - ( 2 )
Delta learning principle:
Note Δ w ijrepresent recursion index word once, then have:
Δw i j = Σ p = 1 P η ( t p i - o p i ) i N = Σ p = 1 P ηδ p i i p j - - - ( 4 )
δ pi=t pi-o pi(5)
η becomes learning efficiency.
Note: from (1) formula, i-th neuronic output is:
i pm=-1, w im=(i-th neuronic threshold value) (6)
It is special in f is linear function,
o p i = a ( Σ j = 1 m w i j i p j ) + b - - - ( 7 )
According to above-described neural network, if wherein each neuron is linear, getting training quota is:
E = Σ p = 1 P E p - - - ( 8 )
E p = 1 2 Σ i = 1 n ( t p i - o p i ) 2 - - - ( 9 )
Time, ask the gradient steepest descent method of the minimum value of E to be exactly Delta learning rules.
State evaluating method selects different quantity of states according to evaluation object, by obtaining relevant warning information data, can realize carrying out running status assessment to server, database, middleware.Example is evaluated as below, the process that description status is analyzed with database positioning.Database positioning primary evaluation content and index as shown in table 2.
Table 2 database positioning evaluation index
Below for database table space the need of dilatation, neural network training judges the need of dilatation database table space.Suppose i p1, i p2, i p3, i p4, i p5respectively 5 indexs of representation database table space state are table space size respectively, utilization rate, use space, remaining space, daily growth amount.Collect certain database table space history achievement data and the need of dilatation Class1 51 groups, wherein 100 groups of data are used for neural network training, and 51 groups of data are for testing neural network to database table space state estimation result.The neural network of select tape hidden layer, hidden layer unit number is 5, and the testing authentication code based on R language is as follows:
ideal<-class.ind(space$Label)
spaceANN<-nnet(space[trainIndex,-8],ideal[trainIndex,],size=5,softmax=TRUE)
testLabel<-predict(spaceANN,space[testIndex,-8],type="class")
my_table<-table(space[testIndex,]$Label,testLabel)
test_error<-1-sum(diag(my_table))/sum(my_table)
According to test findings, in 51 groups of test datas, neural network has carried out correct classification to 50 groups, 1 group of Data classification mistake, and accuracy is 98.03%, meets the actual needs that system is assessed infosystem state.
In addition, a basic work in the O&M failure problems process of capacity predict Ye Shi data center, this method adopts BP neural network to carry out the prediction of capacity.This example selects database table space Zhou Zengchang data to carry out study and the prediction of next weekly data growth pattern, and employing BP neural network learns the data of 80%, and the data of residue 20% are verified, and compare with ARIMA regression algorithm.From experimental result as table 3, BP Neural Network Prediction effect is better than ARIMA regression analysis.Adopt neural network to carry out system state platform and capacity predict, overcome the limitation of the classic method of carrying out fault or issue handling according to single threshold value, the accuracy of state analysis and issue handling is higher.
Table 3 volume space prediction experiment result
Sequence number Numerical value ARIMA predicts the outcome BP neural network prediction result
1 1363 1334.640 1334.889
2 1365 1348.701 1356.293
3 1386 1356.288 1357.923
4 1393 1368.191 1374.546
5 1400 1374.852 1379.911
6 1404 1387.261 1385.511
7 1416 1393.421 1389.267
8 1421 1406.588 1395.449
The effect can carrying out automatically recovering according to infosystem in practice is optimized each BP neural network state analyser, to ensure accuracy and the reliability of recovery automatically.Optionally, according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information, can comprise:
According to described warning information, judge whether described warning information belongs to knowledge base scope;
If belong to, then the BP neural network state analyser corresponding with described warning information is selected to carry out state estimation to described warning information.
Wherein, because not every failure message all can recover automatically, along with the development of technology, the knowledge base automatically recovered can be abundanter, complete; Therefore, when it is also very not complete, need to judge warning information, see its whether knowledge base pipe scope in, if so, then select the BP neural network state analyser corresponding with warning information to carry out state estimation to warning information.If do not exist, then warning message can be sent, or the managerial personnel of notice correspondence enter artificial treatment flow process.
And can summarize on this basis, gradual perfection knowledge base, along with the increase of infosystem fault automatic recovery system service time, knowledge base also can enter benign cycle, finally makes infosystem automatically restoring fault more comprehensively with reliable.Along with technical development, can upgrade knowledge base and revise, ensure the accuracy of knowledge base.
S130, according to condition evaluation results, transfer and mate script accordingly, and perform according to described coupling script and recover order.
Wherein, for often kind of alarm situation, corresponding coupling script can be set it is recovered automatically, in coupling script, store the step instruction of conventional process problem.Call the instruction performed in coupling script, automatically can complete the automatically restoring fault to infosystem.Coupling script can comprise middleware booting script, stop script, rpms restart script RPMS, database table space dilatation script, database node rpms restart script RPMS, F5 load-balancing device isolation script etc.Can perform corresponding script or operation according to the condition selecting of system, the problem found during timely disposal system runs or fault, realize FAQs fast processing, fast quick-recovery business function.
Based on technique scheme, the method of the infosystem automatically restoring fault that the embodiment of the present invention provides, the method adopts BP neural network to carry out system hardware and software state estimation and power system capacity prediction, set up and automatically dispose coupling set of scripts, carry out system state and scripts match in conjunction with issue handling knowledge base, achieve the robotization recovery that infosystem runs typical problem; Namely reduce manual intervention, improve reliability and the problem fault self-recovery ability of infosystem entirety, improve robotization and the intelligent level of maintenance work.
Based on technique scheme, the automatic process of fault and problem is one and typically monitors-process Closed loop operation mechanism, first monitoring information system operating index, carry out state estimation, the rule of attempting being mated by knowledge base starts relevant recovery operation order, then carries out business or alarm clearing checking.Treatment mechanism as shown in Figure 3.The exception of discovery is notified automatic processing procedure, and automatic processing procedure is assessed infosystem state and is judged, and takes some operated from a distances, carries out the process of problem or fault, and business recovery.If there is problem or fault does not belong to knowledge base scope, exceedes automatic number of processes, disposes unsuccessfully, business or alarm do not recover etc. that abnormal conditions then notify predetermined managerial personnel.The schematic diagram that fault and problem process automatically as shown in Figure 4.Concrete processing procedure can be as follows:
Based on technique scheme, preferably, described according to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script, can comprise:
S3, according to condition evaluation results, transfer and mate script accordingly;
S31, judge whether the continuous number of processes of described warning information exceedes corresponding threshold value; If exceed, then can notify corresponding managerial personnel, proceed to artificial treatment flow process.
If S32 does not exceed, then perform according to described coupling script and recover order, and verify whether described warning information recovers;
If S33 recovers, then terminate;
If S34 does not recover, then according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information, and enter S3.
Wherein, the setting of threshold value according to the difference of warning information, and can arrange different numerical value here.Also the threshold value that setting one is corresponding can be unified, such as 3.
Based on technique scheme, also can see that the method can also comprise with reference to figure 4:
Record the automatically restoring fault process logs of described infosystem.
Wherein, technician can by the checking of daily record, add up; Can safeguard infosystem, also can carry out renolation to the method for the automatic recovery of infosystem.Optionally, the method can also comprise:
Regularly according to the daily record of infosystem fault automatic recovery system, BP neural network state analyser and coupling script are safeguarded.
By can improve reliability and the accuracy of system to the maintenance of BP neural network state analyser and coupling script.
Based on technique scheme, the method of the infosystem automatically restoring fault that the embodiment of the present invention provides, the method adopts BP neural network to carry out system hardware and software state estimation and power system capacity prediction, set up and automatically dispose coupling set of scripts, carry out system state and scripts match in conjunction with issue handling knowledge base, achieve the robotization recovery that infosystem runs typical problem; Namely reduce manual intervention, improve reliability and the problem fault self-recovery ability of infosystem entirety, improve robotization and the intelligent level of maintenance work.Also by can improve reliability and the accuracy of system to the maintenance of BP neural network state analyser and coupling script.
Embodiments provide the method for infosystem automatically restoring fault, can automatically recover infosystem fault and problem, reduce manual intervention.
Be introduced the system of the infosystem automatically restoring fault that the embodiment of the present invention provides below, the system of infosystem automatically restoring fault described below can mutual corresponding reference with the method for above-described infosystem automatically restoring fault.
Please refer to Fig. 5, the structured flowchart of the Verification System of the system integration that Fig. 5 provides for the embodiment of the present invention; This system can comprise:
Acquisition module 100, for the numerical value of the monitor state index of obtaining information system;
Warning information module 200, for comparing the numerical value of monitor state index described in each with corresponding predetermined state indication range, and according to comparative result determination warning information;
State estimation module 300, for according to described warning information, selects corresponding BP neural network state analyser to carry out state estimation to described warning information;
Recover module 400, for according to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script.
Optionally, described state estimation module 300 comprises:
Range judging unit, for according to described warning information, judges whether described warning information belongs to knowledge base scope;
State evaluation unit, if for belonging to, then selects the BP neural network state analyser corresponding with described warning information to carry out state estimation to described warning information.
Optionally, described recovery module 400 comprises:
Transfer unit, for according to condition evaluation results, transfer and mate script accordingly;
First judging unit, for judging whether the continuous number of processes of described warning information exceedes corresponding threshold value;
Performance element, if for not exceeding, then performs according to described coupling script and recovers order;
Authentication unit, for verifying whether described warning information recovers;
If do not recover, then trigger described state estimation module according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information.
Based on technique scheme, this system can also comprise:
Log pattern, for recording the automatically restoring fault process logs of described infosystem.
Based on technique scheme, this system can also comprise:
Maintenance module, for the regular daily record according to infosystem fault automatic recovery system, safeguards BP neural network state analyser and coupling script.
According to each technical scheme above-mentioned, this system is that the automatic Recovery processing technology of the supporting information system failure or problem realizes, can a set of infosystem fault automated processing system with the functional modules such as the monitoring of infosystem state index, state analysis and prediction, typical fault or problem knowledge storehouse, operation and script manage, flow engine of Integrated Development, the stable operation of guarantee data center.This system can comprise interbedded formation, operation layer and represent layer.Interbedded formation is mainly come into contacts with server, load balancing, database, application software etc., obtains running status achievement data, performs processing command and the data integration with external system.Operation layer mainly realizes the management etc. of knowledge base management, state analysis, operation and coupling script.Representing layer is regular job use, system configuration, statistical study, and the interface of running situation and disposition.
Verify below by the effect of object lesson to said system:
Built test environment, information measured system is made up of 1 F5 equipment, 4 Linux server, and wherein 2 application servers install Weblogic middleware, 2 database servers installation Oralce11G databases.Infosystem fault automated processing system can be deployed on 1 Windows server.
Content measurement mainly comprises table space not enough alarm disposal, Http service is unavailable, Linux server log space is not enough alarm disposal and server hardware device fault 4 typical fault scenes.Each scene has carried out 5 tests, equal successful execution.Execution result can with reference to table 4, from execution time test result, infosystem fault recovery method in this paper can meet real work needs, improve the processing response time of fault and problem, can quick-recovery business normally use soon, entirety improves the reliability of system cloud gray model, improves the efficiency of maintenance work.
Table 5 problem recovers test case automatically
Native system monitoring can adopt the strategy of every 5 minutes timing acquiring system states, and according to test case, because the issue handling time is shorter, the cycle that problem is sampled to system monitoring release time is directly proportional.Can according to the feature running actual needs or problem, dynamic conditioning sampling period and strategy, to meet the needs of actual information system cloud gray model.
There is system state and judge that inaccurate or script performs failed situation and occurs in automatic problem Recovery processing, at routine use Process-centric, want regularly to recover daily record situation to automatic problem to analyze, correct the mistake or defect that exist in automation issues processing procedure.Analyse in depth infosystem to run and produced problem in use, thoroughly deal with problems technically, avoid problem frequently to occur.
Based on technique scheme, based on infosystem automatically restoring fault technology correlation technique, analyze typical infosystem fault type, by BP neural network, system running state and capacity are assessed, utilize O&M knowledge base management system running status and recovery operation rule, set up typical automated operation script, the robotization achieving most common failure and problem recovers, and entirety improves data center information system operation reliability and maintenance work efficiency.This system overcomes the limitation that single threshold value carries out fault and issue handling.Further, test more widely by carrying out and apply, form more abundant knowledge base and set of scripts, complete platform function, Optimal State data analysing method and model, improve constantly data center's O&M intellectuality, automaticity.
In instructions, each embodiment adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually see.For device disclosed in embodiment, because it corresponds to the method disclosed in Example, so description is fairly simple, relevant part illustrates see method part.
Professional can also recognize further, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with electronic hardware, computer software or the combination of the two, in order to the interchangeability of hardware and software is clearly described, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.
The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can directly use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.
Above the authentication method of the system integration provided by the present invention and system are described in detail.Apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping.It should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention, can also carry out some improvement and modification to the present invention, these improve and modify and also fall in the protection domain of the claims in the present invention.

Claims (10)

1. a method for infosystem automatically restoring fault, is characterized in that, comprising:
The numerical value of the monitor state index of obtaining information system;
The numerical value of monitor state index described in each is compared with corresponding predetermined state indication range, and according to comparative result determination warning information;
According to described warning information, corresponding BP neural network state analyser is selected to carry out state estimation to described warning information;
According to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script.
2. the method for claim 1, is characterized in that, described according to described warning information, selects corresponding BP neural network state analyser to carry out state estimation to described warning information, comprising:
According to described warning information, judge whether described warning information belongs to knowledge base scope;
If belong to, then the BP neural network state analyser corresponding with described warning information is selected to carry out state estimation to described warning information.
3. method as claimed in claim 2, is characterized in that, described according to condition evaluation results, transfers and mates script accordingly, and performs recovery order according to described coupling script, comprising:
S3, according to condition evaluation results, transfer and mate script accordingly;
S31, judge whether the continuous number of processes of described warning information exceedes corresponding threshold value;
If S32 does not exceed, then perform according to described coupling script and recover order, and verify whether described warning information recovers;
If S33 recovers, then terminate;
If S34 does not recover, then according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information, and enter S3.
4. the method as described in any one of claims 1 to 3, is characterized in that, also comprises:
Record the automatically restoring fault process logs of described infosystem.
5. method as claimed in claim 4, is characterized in that, also comprise:
Regularly according to the daily record of infosystem fault automatic recovery system, BP neural network state analyser and coupling script are safeguarded.
6. a system for infosystem automatically restoring fault, is characterized in that, comprising:
Acquisition module, for the numerical value of the monitor state index of obtaining information system;
Warning information module, for comparing the numerical value of monitor state index described in each with corresponding predetermined state indication range, and according to comparative result determination warning information;
State estimation module, for according to described warning information, selects corresponding BP neural network state analyser to carry out state estimation to described warning information;
Recover module, for according to condition evaluation results, transfer and mate script accordingly, and perform recovery order according to described coupling script.
7. system as claimed in claim 6, it is characterized in that, described state estimation module comprises:
Range judging unit, for according to described warning information, judges whether described warning information belongs to knowledge base scope;
State evaluation unit, if for belonging to, then selects the BP neural network state analyser corresponding with described warning information to carry out state estimation to described warning information.
8. system as claimed in claim 7, it is characterized in that, described recovery module comprises:
Transfer unit, for according to condition evaluation results, transfer and mate script accordingly;
First judging unit, for judging whether the continuous number of processes of described warning information exceedes corresponding threshold value;
Performance element, if for not exceeding, then performs according to described coupling script and recovers order;
Authentication unit, for verifying whether described warning information recovers;
If do not recover, then trigger described state estimation module according to described warning information, select corresponding BP neural network state analyser to carry out state estimation to described warning information.
9. the system as described in any one of claim 6 to 8, is characterized in that, also comprises:
Log pattern, for recording the automatically restoring fault process logs of described infosystem.
10. system as claimed in claim 9, is characterized in that, also comprise:
Maintenance module, for the regular daily record according to infosystem fault automatic recovery system, safeguards BP neural network state analyser and coupling script.
CN201510920960.4A 2015-12-11 2015-12-11 Method and system for automatic fault recovery of information system Pending CN105550100A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510920960.4A CN105550100A (en) 2015-12-11 2015-12-11 Method and system for automatic fault recovery of information system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510920960.4A CN105550100A (en) 2015-12-11 2015-12-11 Method and system for automatic fault recovery of information system

Publications (1)

Publication Number Publication Date
CN105550100A true CN105550100A (en) 2016-05-04

Family

ID=55829294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510920960.4A Pending CN105550100A (en) 2015-12-11 2015-12-11 Method and system for automatic fault recovery of information system

Country Status (1)

Country Link
CN (1) CN105550100A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955864A (en) * 2016-04-26 2016-09-21 浪潮(北京)电子信息产业有限公司 Power supply fault processing method, power supply module, monitoring management module and server
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function
CN107707392A (en) * 2017-09-26 2018-02-16 厦门集微科技有限公司 Passage restorative procedure and device, terminal
CN107846314A (en) * 2017-10-31 2018-03-27 广西宜州市联森网络科技有限公司 A kind of intelligent operation management system
CN107862393A (en) * 2017-10-31 2018-03-30 广西宜州市联森网络科技有限公司 A kind of IT operation management system
CN108829785A (en) * 2018-05-31 2018-11-16 沈文策 The restorative procedure of bug list, device, electronic equipment and storage medium in database
CN109062082A (en) * 2018-07-17 2018-12-21 深圳市万华汽车服务投资控股有限公司 A kind of intelligent trouble processing method, device and system
CN109450699A (en) * 2018-12-06 2019-03-08 合肥海诺恒信息科技有限公司 Integration firm IT operation management system and method
CN109728979A (en) * 2019-03-01 2019-05-07 国网新疆电力有限公司信息通信公司 Automatic warning system and method suitable for information O&M comprehensive supervision platform
CN110221975A (en) * 2019-05-28 2019-09-10 厦门美柚信息科技有限公司 Create the method and device of interface use-case automatic test script
CN110389610A (en) * 2019-07-30 2019-10-29 陕西学前师范学院 Space light and temperature monitoring system based on Internet of Things
CN110569139A (en) * 2019-08-02 2019-12-13 中国船舶工业系统工程研究院 vitality guarantee system and method for information system
CN111694706A (en) * 2020-05-08 2020-09-22 广州微算互联信息技术有限公司 Cloud mobile phone fault processing method and system and storage medium
CN111813605A (en) * 2020-07-20 2020-10-23 北京百度网讯科技有限公司 Disaster recovery method, platform, electronic device, and medium
CN113572637A (en) * 2021-07-16 2021-10-29 中盈优创资讯科技有限公司 Network fault automatic preprocessing method and device
CN116048865A (en) * 2023-02-21 2023-05-02 海南电网有限责任公司信息通信分公司 Automatic verification method for failure elimination verification under automatic operation and maintenance
WO2023104219A1 (en) * 2021-12-07 2023-06-15 广州地铁集团有限公司 Solution method based on internet of things rail transit for software and application fault self-healing

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074604A1 (en) * 2004-09-24 2006-04-06 International Business Machines (Ibm) Corporation Identifying a state of a system using an artificial neural network generated model
CN102130783A (en) * 2011-01-24 2011-07-20 浪潮通信信息系统有限公司 Intelligent alarm monitoring method of neural network
CN103412805A (en) * 2013-07-31 2013-11-27 交通银行股份有限公司 IT (information technology) fault source diagnosis method and IT fault source diagnosis system
CN103699489A (en) * 2014-01-03 2014-04-02 中国人民解放军装甲兵工程学院 Software remote fault diagnosis and repair method based on knowledge base
CN104038373A (en) * 2014-05-30 2014-09-10 国家电网公司 Information early warning and self repairing system and method
CN104793607A (en) * 2015-04-20 2015-07-22 国家电网公司 Online fault diagnosis, health analysis and failure prediction system and online fault diagnosis, health analysis and failure prediction method for servers
CN104835103A (en) * 2015-05-11 2015-08-12 大连理工大学 Mobile network health evaluation method based on neural network and fuzzy comprehensive evaluation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074604A1 (en) * 2004-09-24 2006-04-06 International Business Machines (Ibm) Corporation Identifying a state of a system using an artificial neural network generated model
CN102130783A (en) * 2011-01-24 2011-07-20 浪潮通信信息系统有限公司 Intelligent alarm monitoring method of neural network
CN103412805A (en) * 2013-07-31 2013-11-27 交通银行股份有限公司 IT (information technology) fault source diagnosis method and IT fault source diagnosis system
CN103699489A (en) * 2014-01-03 2014-04-02 中国人民解放军装甲兵工程学院 Software remote fault diagnosis and repair method based on knowledge base
CN104038373A (en) * 2014-05-30 2014-09-10 国家电网公司 Information early warning and self repairing system and method
CN104793607A (en) * 2015-04-20 2015-07-22 国家电网公司 Online fault diagnosis, health analysis and failure prediction system and online fault diagnosis, health analysis and failure prediction method for servers
CN104835103A (en) * 2015-05-11 2015-08-12 大连理工大学 Mobile network health evaluation method based on neural network and fuzzy comprehensive evaluation

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955864B (en) * 2016-04-26 2019-05-28 浪潮(北京)电子信息产业有限公司 Power failure processing method, power module, monitoring management module and server
CN105955864A (en) * 2016-04-26 2016-09-21 浪潮(北京)电子信息产业有限公司 Power supply fault processing method, power supply module, monitoring management module and server
CN107707392A (en) * 2017-09-26 2018-02-16 厦门集微科技有限公司 Passage restorative procedure and device, terminal
CN107612756A (en) * 2017-10-31 2018-01-19 广西宜州市联森网络科技有限公司 A kind of operation management system with intelligent trouble analyzing and processing function
CN107846314A (en) * 2017-10-31 2018-03-27 广西宜州市联森网络科技有限公司 A kind of intelligent operation management system
CN107862393A (en) * 2017-10-31 2018-03-30 广西宜州市联森网络科技有限公司 A kind of IT operation management system
CN108829785A (en) * 2018-05-31 2018-11-16 沈文策 The restorative procedure of bug list, device, electronic equipment and storage medium in database
CN109062082B (en) * 2018-07-17 2020-07-10 深圳市雅宝智能装备系统有限公司 Intelligent fault processing method, device and system
CN109062082A (en) * 2018-07-17 2018-12-21 深圳市万华汽车服务投资控股有限公司 A kind of intelligent trouble processing method, device and system
CN109450699A (en) * 2018-12-06 2019-03-08 合肥海诺恒信息科技有限公司 Integration firm IT operation management system and method
CN109728979A (en) * 2019-03-01 2019-05-07 国网新疆电力有限公司信息通信公司 Automatic warning system and method suitable for information O&M comprehensive supervision platform
CN110221975A (en) * 2019-05-28 2019-09-10 厦门美柚信息科技有限公司 Create the method and device of interface use-case automatic test script
CN110221975B (en) * 2019-05-28 2022-06-28 厦门美柚股份有限公司 Method and device for creating interface case automation test script
CN110389610A (en) * 2019-07-30 2019-10-29 陕西学前师范学院 Space light and temperature monitoring system based on Internet of Things
CN110569139A (en) * 2019-08-02 2019-12-13 中国船舶工业系统工程研究院 vitality guarantee system and method for information system
CN110569139B (en) * 2019-08-02 2023-04-14 中国船舶工业系统工程研究院 Vitality guarantee system and method for information system
CN111694706A (en) * 2020-05-08 2020-09-22 广州微算互联信息技术有限公司 Cloud mobile phone fault processing method and system and storage medium
CN111813605A (en) * 2020-07-20 2020-10-23 北京百度网讯科技有限公司 Disaster recovery method, platform, electronic device, and medium
CN113572637A (en) * 2021-07-16 2021-10-29 中盈优创资讯科技有限公司 Network fault automatic preprocessing method and device
WO2023104219A1 (en) * 2021-12-07 2023-06-15 广州地铁集团有限公司 Solution method based on internet of things rail transit for software and application fault self-healing
CN116048865A (en) * 2023-02-21 2023-05-02 海南电网有限责任公司信息通信分公司 Automatic verification method for failure elimination verification under automatic operation and maintenance

Similar Documents

Publication Publication Date Title
CN105550100A (en) Method and system for automatic fault recovery of information system
Hoffmann et al. Advanced failure prediction in complex software systems
CN110457175B (en) Service data processing method and device, electronic equipment and medium
US11906112B2 (en) Methods for safety management of compressors in smart gas pipeline network and internet of things systems thereof
CN113193881B (en) Intelligent functional verification detection method based on HPLC (high performance liquid chromatography) deepened application
CN114328198A (en) System fault detection method, device, equipment and medium
CN113946499A (en) Micro-service link tracking and performance analysis method, system, equipment and application
Zeng et al. Estimation of software defects fix effort using neural networks
Raja et al. Combined analysis of support vector machine and principle component analysis for IDS
CN105471647A (en) Power communication network fault positioning method
CN111259073A (en) Intelligent business system running state studying and judging system based on logs, flow and business access
CN108490806A (en) Based on the system resilience Simulation Evaluation method for improving fault modes and effect analysis
CN115952081A (en) Software testing method, device, storage medium and equipment
CN113721182B (en) Method and system for evaluating reliability of online performance monitoring result of power transformer
CN111738348B (en) Power data anomaly detection method and device
CN115114124A (en) Host risk assessment method and device
CN110807014B (en) Cross validation based station data anomaly discrimination method and device
CN112949201A (en) Wind speed prediction method and device, electronic equipment and storage medium
CN113760689A (en) Interface fault alarm method, device, equipment and storage medium
CN116151799A (en) BP neural network-based distribution line multi-working-condition fault rate rapid assessment method
CN115423041A (en) Edge cloud fault prediction method and system based on deep learning
CN115334560A (en) Method, device and equipment for monitoring base station abnormity and computer readable storage medium
CN115078952B (en) IGBT driving fault detection method and system
Perlov et al. Failure Forecasting System by the Diagnostic Data of the Radio Information Systems
US20230308461A1 (en) Event-Based Machine Learning for a Time-Series Metric

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160504