CN109634777A - A kind of sales service system O&M emergency disposal and quick recovery method - Google Patents

A kind of sales service system O&M emergency disposal and quick recovery method Download PDF

Info

Publication number
CN109634777A
CN109634777A CN201811302525.5A CN201811302525A CN109634777A CN 109634777 A CN109634777 A CN 109634777A CN 201811302525 A CN201811302525 A CN 201811302525A CN 109634777 A CN109634777 A CN 109634777A
Authority
CN
China
Prior art keywords
database
data
file
application
lock
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811302525.5A
Other languages
Chinese (zh)
Inventor
胡楠
杜红军
刘树吉
乔林
刘颖
孙宝华
刘为
吴赫
周巧妮
徐立波
冉冉
李云鹏
李东洋
于元旗
曲睿婷
周大鹏
胡非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Original Assignee
Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd filed Critical Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Priority to CN201811302525.5A priority Critical patent/CN109634777A/en
Publication of CN109634777A publication Critical patent/CN109634777A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of sales service system O&M emergency disposal and quick recovery method, it is characterised in that: including the processing of following situations: one, data center's website disaster: two, finding loss of data;Three, the CPU high of database host;Four, application system performance is slow;Five, database filing file takes memory space;Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values;Seven, Mission critical applications strategy.The process flow of specification emergency is realized rapid, orderly, efficient malfunction elimination and is solved, utmostly shorten fault time, guarantee the safe and stable operation of marketing Base data platform system, improves operation and maintenance level.

Description

A kind of sales service system O&M emergency disposal and quick recovery method
Technical field
The present invention relates to power marketing technical field, specially a kind of sales service system O&M emergency disposal and quickly it is extensive Compound method.
Background technique
With State Grid Liaoning Electric Power Co., Ltd.'s big marketing system construction and continuous perfect, marketing analysis is determined with auxiliary Plan system needs to carry out related upgrading change according to marketing system to meet perfect report demand, and resource is effectively integrated, On the basis of traditional report business is constantly reinforced, marketing analysis carries out report update with aid decision-making system, increases newly.It is now big Marketing system provides information-based support, ensures the report stable operation of State Grid Liaoning Electric Power Co., Ltd.'s aid decision-making system, needs A large amount of manpower and reliable technical support are wanted, smoothly completes each of State Grid Liaoning Electric Power Co., Ltd.'s whole year to ensure Item index can be up to standard.
For the business particularity of marketing Base data platform, the maintenance work of marketing Base data platform system is as weight Point technological difficulties, there are many deficiencies for existing disposal options method;Emergency response mechanism needs three tissue, technology, process sides The collective effect in face, to guarantee that system event can be handled in time.
In organizational aspects, needs Utilities Electric Co. to be collectively formed one and respond tissue for system emergency.The role of the tissue Specifically include that leading group, the finder of system event, the decision maker of processing step, executor of technical movements etc..Technical side Face needs system application response tissue registration's system information assets, establishes system emergency technical solution, periodically carry out emergency drilling Deng.
Summary of the invention
The technical problem to be solved by the present invention is to provide a kind of sales service system O&M emergency disposal and fast quick-recovery side Method relates generally to power marketing business, is suitable for because of all kinds of originals such as network failure, server performance failure, isolating device failures Because causing State Grid Corporation of China's marketing Base data platform system to be unable to the emergency event of normal use.
The technical solution of use are as follows:
A kind of sales service system O&M emergency disposal and quick recovery method, including the processing of following situations:
One, data center's website disaster:
Phenomenon: water, fire, earthquake etc. bring the physical facility of website to paralyse;This support for needing that there is disaster tolerance system and guarantor Shield.Production database is serious unavailable (all merging phenomenons, analysis, processing mode);
Phenomenon and analysis:
Phenomenon: operation system is integrally paused.
Analysis: two database detecting host operating systems of connection are all unavailable, or check discovery by operating system The equipment of disk array is all inaccessible.
Processing mode:
The severity of further fault analysis and judgement is needed, then decides whether to enable the inquiry generated by data duplication Library, as Production database.
It such as determines then to need following committed step using inquiry library:
1, stop process of the database duplication software on inquiry library;
2, stop business application;
3, by business application to the connected reference configuration of original Production database be all adjusted to inquiry library into Row access;
4, application program is reactivated;
Two, loss of data is found:
Phenomenon and analysis:
It was found that there is loss of data, it is divided into two kinds of situations:
Find loss of data in business personnel's operating process, this have two may: 1, the loss of data 2 of a small amount of line number, a large amount of Even whole table loss;
Host, database or storage system failure bring lose can not which kind of situation, require manual intervention It could complete to restore.
(1) low volume data loss processing method:
The inconsistent situation of data is checked by application programming, developer, specific loss situation in the document without Method describes, and in general, to analyze severity, and whether can by business operation be made up, can make up for it, then if investigating It is completed by business operation;Otherwise " database recovery processing method " is used;
(2) pass through the processing method of database recovery:
Whether low volume data or mass data, as long as can not be restored by means such as business operation, operation flows, It requires to complete to restore by the processing method of database recovery;
Database recovery is meant, using the standby system having built up, completes to restore by backup software, mainly Update number using the restoration methods based on time point, at the time of restoring data to before loss of data, after this moment According to needing the manual faithful record or added by data derived from backup logic.
Three, the CPU high of database host:
Phenomenon and analysis:
Macroscopic appearance: application response time is long, and customers' responsiveness system is slow;
Microphenomenon: with top-h order discovery on 10.231.XX.XX or 10.231.XX.XX;
The cpu avg utilization rate of USER is higher, as follows:
cpu states:(avg)
In detail performance: may for multiple oracle process cpu utilization rate close to 99% or major part cpu utilization rate all It is higher;
Processing method:
1. checking the process of CPU high by top-h, process number, i.e. pid are obtained;
2. finding corresponding session with sql tool sentence, and execute SQL;
3. administrator's identity logs database server, execution kills this session (session);It is deposited for a long time in database Lock (lock waits);
Emergency step:
1, the situation of lock is checked by sql tool;
2, observe which session starts locking;
3, the current sql of session is recorded;
If 4, the time that lock waits is more than 10 minutes, while the number locked is fewer, can kill process;
If the quantity 5, locked is relatively more, and exists and continue growing sign, lock should be also killed immediately;
6, acquired sql technical specialist's group is issued to analyze;
Four, application system performance is slow:
Processing method:
1. checking on database whether there is lock, if there is handling according to corresponding technical method;
2. checking whether database host CPU is high.If high, corresponding sql is searched, and according to treating database CPU high Mode handle session;
3. checking whether application server is abnormal (CPU high, garbage reclamation exception, waiting list > 0), if it is, by answering It is handled with the method for server CPU high;
Subsequent processing:
Do the stackpack performance evaluation of database;
Analyze thread it is derived as a result, in positioning application there may be the problem of, and result is fed back into technical specialist's group. This defect is analyzed by technical specialist's group and perfect;
Five, database filing file takes memory space:
In two kinds of situation, one is current database archive files to take memory space.This needs are handled at once, directly Login service device shifts archive file, then deletes corresponding archive file, and filing is directly deleted if there isn't enough time File (is careful not to delete whole filings, two archive files at least wanting retention time nearest);
Second situation is estimated before DB Backup next time, and archive file can take memory space, this feelings Condition preferably logs in backup server and manually performs primary backup, if it is determined that there isn't enough time then using the place of the first situation Reason method;
It needs to make the method for manually performing backup in advance;
Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values:
Phenomenon and analysis:
Gone out by system monitoring process feedback, the space utilization rate of some file system (mount point) is more than pre- in unix system Alert value;
Possible reason has: the file of a. manual backup generation, b. application, which report an error, generates a large amount of journal files, c. operation system System, which reports an error, generates heap file;
Processing method:
1, the size cases of each catalogue and file are checked under the mount point by du-s* order;
2, the catalogue bigger into space hold, the size of catalogue and file is determined with same method;
3, judge that belonging to for reason is any.If it is a, then contacting related personnel's determination could delete;If it is b or c, Then coupled system administrative staff or relevant manufactures confirm;
It 4,, be first by file backup to be deleted to locally if time and condition allow after confirmation can delete;
5, when deleting using rm order, to confirm that the command option used is completely correct.Batch is carried out using asterisk wildcard to delete Except when, whether the usage that asterisk wildcard is verified in Yao Xianyong ls order correct, in order to avoid maloperation;
Seven, Mission critical applications strategy:
When database or application server failure occurs, suggest taking following emergency measure when the short time cannot restore:
1, the influence illustrated due to the system failure, to applied business is summarized, how to be restored normal;
2, according to pressure priority level, window, good service etc. separates the recovery sequence of applied business.
Present invention has an advantage that
The process flow of specification emergency is realized rapid, orderly, efficient malfunction elimination and is solved, utmostly contracts Short fault time guarantees the safe and stable operation of marketing Base data platform system, improves operation and maintenance level.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in the embodiment of the present invention Technical solution is clearly and completely described, it is clear that described embodiments are some of the embodiments of the present invention, rather than complete The embodiment in portion.Based on the embodiments of the present invention, those of ordinary skill in the art are without creative efforts Every other embodiment obtained, shall fall within the protection scope of the present invention.
Embodiment:
A kind of sales service system O&M emergency disposal and quick recovery method, including the processing of following situations:
One, data center's website disaster:
Phenomenon: water, fire, earthquake etc. bring the physical facility of website to paralyse;This support for needing that there is disaster tolerance system and guarantor Shield.Production database is serious unavailable (all merging phenomenons, analysis, processing mode);
Phenomenon and analysis:
Phenomenon: operation system is integrally paused.
Analysis: two database detecting host operating systems of connection are all unavailable, or check discovery by operating system The equipment of disk array is all inaccessible.
Processing mode:
The severity of further fault analysis and judgement is needed, then decides whether to enable the inquiry generated by data duplication Library, as Production database.
It such as determines then to need following committed step using inquiry library:
1, stop process of the database duplication software on inquiry library;
2, stop business application;
3, by business application to the connected reference configuration of original Production database be all adjusted to inquiry library into Row access;
4, application program is reactivated;
Two, loss of data is found:
Phenomenon and analysis:
It was found that there is loss of data, it is divided into two kinds of situations:
Find loss of data in business personnel's operating process, this have two may: 1, the loss of data 2 of a small amount of line number, a large amount of Even whole table loss;
Host, database or storage system failure bring lose can not which kind of situation, require manual intervention It could complete to restore.
(1) low volume data loss processing method:
The inconsistent situation of data is checked by application programming, developer, specific loss situation in the document without Method describes, and in general, to analyze severity, and whether can by business operation be made up, can make up for it, then if investigating It is completed by business operation;Otherwise " database recovery processing method " is used;
(2) pass through the processing method of database recovery:
Whether low volume data or mass data, as long as can not be restored by means such as business operation, operation flows, It requires to complete to restore by the processing method of database recovery;
Database recovery is meant, using the standby system having built up, completes to restore by backup software, mainly Update number using the restoration methods based on time point, at the time of restoring data to before loss of data, after this moment According to needing the manual faithful record or added by data derived from backup logic.
Three, the CPU high of database host:
Phenomenon and analysis:
Macroscopic appearance: application response time is long, and customers' responsiveness system is slow;
Microphenomenon: with top-h order discovery on 10.231.XX.XX or 10.231.XX.XX;
The cpu avg utilization rate of USER is higher, as follows:
cpu states:(avg)
In detail performance: may for multiple oracle process cpu utilization rate close to 99% or major part cpu utilization rate all It is higher;
Processing method:
1. checking the process of CPU high by top-h, process number, i.e. pid are obtained;
2. finding corresponding session with sql tool sentence, and execute SQL;
3. administrator's identity logs database server, execution kills this session (session);
There is lock in database for a long time (lock waits);
Emergency step:
1, the situation of lock is checked by sql tool;
2, observe which session starts locking;
3, the current sql of session is recorded;
If 4, the time that lock waits is more than 10 minutes, while the number locked is fewer, can kill process;
If the quantity 5, locked is relatively more, and exists and continue growing sign, lock should be also killed immediately;
6, acquired sql technical specialist's group is issued to analyze;
Four, application system performance is slow:
Processing method:
1. checking on database whether there is lock, if there is handling according to corresponding technical method;
2. checking whether database host CPU is high.If high, corresponding sql is searched, and according to treating database CPU high Mode handle session;
3. checking whether application server is abnormal (CPU high, garbage reclamation exception, waiting list > 0), if it is, by answering It is handled with the method for server CPU high;
Subsequent processing:
Do the stackpack performance evaluation of database;
Analyze thread it is derived as a result, in positioning application there may be the problem of, and result is fed back into technical specialist's group. This defect is analyzed by technical specialist's group and perfect;
Five, database filing file takes memory space:
In two kinds of situation, one is current database archive files to take memory space.This needs are handled at once, directly Login service device shifts archive file, then deletes corresponding archive file, and filing is directly deleted if there isn't enough time File (is careful not to delete whole filings, two archive files at least wanting retention time nearest);
Second situation is estimated before DB Backup next time, and archive file can take memory space, this feelings Condition preferably logs in backup server and manually performs primary backup, if it is determined that there isn't enough time then using the place of the first situation Reason method;
It needs to make the method for manually performing backup in advance;
Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values:
Phenomenon and analysis:
Gone out by system monitoring process feedback, the space utilization rate of some file system (mount point) is more than pre- in unix system Alert value;
Possible reason has: the file of a. manual backup generation, b. application, which report an error, generates a large amount of journal files, c. operation system System, which reports an error, generates heap file;
Processing method:
1, the size cases of each catalogue and file are checked under the mount point by du-s* order;
2, the catalogue bigger into space hold, the size of catalogue and file is determined with same method;
3, judge that belonging to for reason is any.If it is a, then contacting related personnel's determination could delete;If it is b or c, Then coupled system administrative staff or relevant manufactures confirm;
It 4,, be first by file backup to be deleted to locally if time and condition allow after confirmation can delete;
5, when deleting using rm order, to confirm that the command option used is completely correct.Batch is carried out using asterisk wildcard to delete Except when, whether the usage that asterisk wildcard is verified in Yao Xianyong ls order correct, in order to avoid maloperation;
Seven, Mission critical applications strategy:
When database or application server failure occurs, suggest taking following emergency measure when the short time cannot restore:
1, the influence illustrated due to the system failure, to applied business is summarized, how to be restored normal;
2, according to pressure priority level, window, good service etc. separates the recovery sequence of applied business.
Although the present invention is disclosed above in the preferred embodiment, it is not intended to limit the invention the range of implementation.Any The those of ordinary skill in field is not departing from invention scope of the invention, improves when can make a little, i.e., all according to this hair Bright done same improvement, should be the scope of the present invention and is covered.

Claims (1)

1. a kind of sales service system O&M emergency disposal and quick recovery method, it is characterised in that: including the processing of following situations:
One, data center's website disaster:
The severity of further fault analysis and judgement is needed, then decides whether to enable the inquiry library generated by data duplication, As Production database;
It such as determines then to need following committed step using inquiry library:
(1) stop process of the database duplication software on inquiry library;
(2) stop business application;
(3) business application is all adjusted to visit inquiry library to the connected reference configuration of original Production database It asks;
(4) application program is reactivated;
Two, loss of data is found:
(1) low volume data loss processing method:
Check that the inconsistent situation of data, specific loss situation can not be use up in the document by application programming, developer It states, in general, to analyze severity, whether can by business operation be made up, can make up for it, then pass through if investigating Business operation is completed;Otherwise " database recovery processing method " is used;
(2) pass through the processing method of database recovery:
Whether low volume data or mass data, as long as can not be restored by means such as business operation, operation flows, all need It to complete to restore by the processing method of database recovery;
Three, the CPU high of database host:
(1) process that CPU high is checked by top-h, obtains process number, i.e. pid;
(2) corresponding session is found with sql tool sentence, and executes SQL;
(3) administrator's identity logs database server, execution kill this session (session);
(4) there is lock in database for a long time (lock waits);
Emergency step:
(1) situation of lock is checked by sql tool;
(2) observe which session starts locking;
(3) the current sql of session is recorded;
(4) if the time that lock waits is more than 10 minutes, while the number locked is fewer, can kill process;
(5) if the quantity of lock is relatively more, and exist and continue growing sign, should also kill lock immediately;
(6) acquired sql technical specialist's group is issued to analyze;
Four, application system performance is slow:
(1) check on database whether there is lock, if there is handling according to corresponding technical method;
(2) check whether database host CPU is high.If high, corresponding sql is searched, and according to the side for treating database CPU high Formula handles session;
(3) check whether application server is abnormal (CPU high, garbage reclamation exception, waiting list > 0), if it is, by application The method of server CPU high is handled;
(4) subsequent processing: the stackpack performance evaluation of database is done;It is derived as a result, may in positioning application to analyze thread There are the problem of, and result is fed back into technical specialist's group.This defect is analyzed by technical specialist's group and perfect;
Five, database filing file takes memory space:
In two kinds of situation, one is current database archive files to take memory space.This needs are handled at once, are directly logged in Server shifts archive file, then deletes corresponding archive file, and archive file is directly deleted if there isn't enough time;
Second situation is estimated before DB Backup next time, and archive file can take memory space, and such case is most It is to log in backup server to manually perform primary backup well, if it is determined that there isn't enough time then using the processing side of the first situation Method;
It needs to make the method for manually performing backup in advance;
Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values:
(1) size cases of each catalogue and file are checked under the mount point by du-s* order;
(2) enter the bigger catalogue of space hold, the size of catalogue and file is determined with same method;
(3) judge that belonging to for reason is any.If it is a, then contacting related personnel's determination could delete;If it is b or c, then Coupled system administrative staff or relevant manufactures confirm;
It (4), be first by file backup to be deleted to locally if time and condition allow after confirmation can delete;
(5) when deleting using rm order, to confirm that the command option used is completely correct.Batch deletion is carried out using asterisk wildcard When, whether the usage that asterisk wildcard is verified in Yao Xianyong ls order is correct, in order to avoid maloperation;
Seven, Mission critical applications strategy:
When database or application server failure occurs, suggest taking following emergency measure when the short time cannot restore:
(1) influence illustrated due to the system failure, to applied business is summarized, how to be restored normal;
(2) according to pressure priority level, window, good service etc. separates the recovery sequence of applied business.
CN201811302525.5A 2018-11-02 2018-11-02 A kind of sales service system O&M emergency disposal and quick recovery method Pending CN109634777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811302525.5A CN109634777A (en) 2018-11-02 2018-11-02 A kind of sales service system O&M emergency disposal and quick recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811302525.5A CN109634777A (en) 2018-11-02 2018-11-02 A kind of sales service system O&M emergency disposal and quick recovery method

Publications (1)

Publication Number Publication Date
CN109634777A true CN109634777A (en) 2019-04-16

Family

ID=66067183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811302525.5A Pending CN109634777A (en) 2018-11-02 2018-11-02 A kind of sales service system O&M emergency disposal and quick recovery method

Country Status (1)

Country Link
CN (1) CN109634777A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347175A (en) * 2020-11-11 2021-02-09 欧冶云商股份有限公司 Cross-database remote measurement self-healing method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347175A (en) * 2020-11-11 2021-02-09 欧冶云商股份有限公司 Cross-database remote measurement self-healing method and system

Similar Documents

Publication Publication Date Title
US11175982B2 (en) Remote monitoring and error correcting within a data storage system
KR101856543B1 (en) Failure prediction system based on artificial intelligence
US8863224B2 (en) System and method of managing data protection resources
US8126848B2 (en) Automated method for identifying and repairing logical data discrepancies between database replicas in a database cluster
JP5148607B2 (en) Automation of standard operating procedures in database management
US9411969B2 (en) System and method of assessing data protection status of data protection resources
CN105099783B (en) A kind of method and system for realizing operation system alarm emergency disposal automation
US20080155091A1 (en) Remote monitoring in a computer network
CN105955662A (en) Method and system for expansion of K-DB data table space
CN113515499A (en) Database service method and system
EP3202091B1 (en) Operation of data network
CN108809729A (en) The fault handling method and device that CTDB is serviced in a kind of distributed system
CN109634777A (en) A kind of sales service system O&M emergency disposal and quick recovery method
US8056052B2 (en) Populating service requests
CN105550094B (en) A kind of high-availability system state automatic monitoring method
JP2017211722A (en) Application support program, application support device and application support method
JP3992029B2 (en) Object management method
Mukherjee et al. Challenges of DB2 restore in a distributed systems environment and engineered solutions
CN106850305A (en) A kind of IT operation management method and device
TWI690810B (en) Database management system and database management method
CN115033649A (en) Fault processing method, device, equipment and storage medium based on report development
CN116483672A (en) Oracle database performance index acquisition method and device
CN107016101A (en) Data managing method, apparatus and system
CN116701423A (en) Method, device, equipment and storage medium for updating operation logic library
CN117851122A (en) Disaster recovery backup and recovery system of power information system in cloud environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190416

WD01 Invention patent application deemed withdrawn after publication