CN109634777A - A kind of sales service system O&M emergency disposal and quick recovery method - Google Patents
A kind of sales service system O&M emergency disposal and quick recovery method Download PDFInfo
- Publication number
- CN109634777A CN109634777A CN201811302525.5A CN201811302525A CN109634777A CN 109634777 A CN109634777 A CN 109634777A CN 201811302525 A CN201811302525 A CN 201811302525A CN 109634777 A CN109634777 A CN 109634777A
- Authority
- CN
- China
- Prior art keywords
- database
- data
- file
- application
- lock
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1469—Backup restoration techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of sales service system O&M emergency disposal and quick recovery method, it is characterised in that: including the processing of following situations: one, data center's website disaster: two, finding loss of data;Three, the CPU high of database host;Four, application system performance is slow;Five, database filing file takes memory space;Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values;Seven, Mission critical applications strategy.The process flow of specification emergency is realized rapid, orderly, efficient malfunction elimination and is solved, utmostly shorten fault time, guarantee the safe and stable operation of marketing Base data platform system, improves operation and maintenance level.
Description
Technical field
The present invention relates to power marketing technical field, specially a kind of sales service system O&M emergency disposal and quickly it is extensive
Compound method.
Background technique
With State Grid Liaoning Electric Power Co., Ltd.'s big marketing system construction and continuous perfect, marketing analysis is determined with auxiliary
Plan system needs to carry out related upgrading change according to marketing system to meet perfect report demand, and resource is effectively integrated,
On the basis of traditional report business is constantly reinforced, marketing analysis carries out report update with aid decision-making system, increases newly.It is now big
Marketing system provides information-based support, ensures the report stable operation of State Grid Liaoning Electric Power Co., Ltd.'s aid decision-making system, needs
A large amount of manpower and reliable technical support are wanted, smoothly completes each of State Grid Liaoning Electric Power Co., Ltd.'s whole year to ensure
Item index can be up to standard.
For the business particularity of marketing Base data platform, the maintenance work of marketing Base data platform system is as weight
Point technological difficulties, there are many deficiencies for existing disposal options method;Emergency response mechanism needs three tissue, technology, process sides
The collective effect in face, to guarantee that system event can be handled in time.
In organizational aspects, needs Utilities Electric Co. to be collectively formed one and respond tissue for system emergency.The role of the tissue
Specifically include that leading group, the finder of system event, the decision maker of processing step, executor of technical movements etc..Technical side
Face needs system application response tissue registration's system information assets, establishes system emergency technical solution, periodically carry out emergency drilling
Deng.
Summary of the invention
The technical problem to be solved by the present invention is to provide a kind of sales service system O&M emergency disposal and fast quick-recovery side
Method relates generally to power marketing business, is suitable for because of all kinds of originals such as network failure, server performance failure, isolating device failures
Because causing State Grid Corporation of China's marketing Base data platform system to be unable to the emergency event of normal use.
The technical solution of use are as follows:
A kind of sales service system O&M emergency disposal and quick recovery method, including the processing of following situations:
One, data center's website disaster:
Phenomenon: water, fire, earthquake etc. bring the physical facility of website to paralyse;This support for needing that there is disaster tolerance system and guarantor
Shield.Production database is serious unavailable (all merging phenomenons, analysis, processing mode);
Phenomenon and analysis:
Phenomenon: operation system is integrally paused.
Analysis: two database detecting host operating systems of connection are all unavailable, or check discovery by operating system
The equipment of disk array is all inaccessible.
Processing mode:
The severity of further fault analysis and judgement is needed, then decides whether to enable the inquiry generated by data duplication
Library, as Production database.
It such as determines then to need following committed step using inquiry library:
1, stop process of the database duplication software on inquiry library;
2, stop business application;
3, by business application to the connected reference configuration of original Production database be all adjusted to inquiry library into
Row access;
4, application program is reactivated;
Two, loss of data is found:
Phenomenon and analysis:
It was found that there is loss of data, it is divided into two kinds of situations:
Find loss of data in business personnel's operating process, this have two may: 1, the loss of data 2 of a small amount of line number, a large amount of
Even whole table loss;
Host, database or storage system failure bring lose can not which kind of situation, require manual intervention
It could complete to restore.
(1) low volume data loss processing method:
The inconsistent situation of data is checked by application programming, developer, specific loss situation in the document without
Method describes, and in general, to analyze severity, and whether can by business operation be made up, can make up for it, then if investigating
It is completed by business operation;Otherwise " database recovery processing method " is used;
(2) pass through the processing method of database recovery:
Whether low volume data or mass data, as long as can not be restored by means such as business operation, operation flows,
It requires to complete to restore by the processing method of database recovery;
Database recovery is meant, using the standby system having built up, completes to restore by backup software, mainly
Update number using the restoration methods based on time point, at the time of restoring data to before loss of data, after this moment
According to needing the manual faithful record or added by data derived from backup logic.
Three, the CPU high of database host:
Phenomenon and analysis:
Macroscopic appearance: application response time is long, and customers' responsiveness system is slow;
Microphenomenon: with top-h order discovery on 10.231.XX.XX or 10.231.XX.XX;
The cpu avg utilization rate of USER is higher, as follows:
cpu states:(avg)
In detail performance: may for multiple oracle process cpu utilization rate close to 99% or major part cpu utilization rate all
It is higher;
Processing method:
1. checking the process of CPU high by top-h, process number, i.e. pid are obtained;
2. finding corresponding session with sql tool sentence, and execute SQL;
3. administrator's identity logs database server, execution kills this session (session);It is deposited for a long time in database
Lock (lock waits);
Emergency step:
1, the situation of lock is checked by sql tool;
2, observe which session starts locking;
3, the current sql of session is recorded;
If 4, the time that lock waits is more than 10 minutes, while the number locked is fewer, can kill process;
If the quantity 5, locked is relatively more, and exists and continue growing sign, lock should be also killed immediately;
6, acquired sql technical specialist's group is issued to analyze;
Four, application system performance is slow:
Processing method:
1. checking on database whether there is lock, if there is handling according to corresponding technical method;
2. checking whether database host CPU is high.If high, corresponding sql is searched, and according to treating database CPU high
Mode handle session;
3. checking whether application server is abnormal (CPU high, garbage reclamation exception, waiting list > 0), if it is, by answering
It is handled with the method for server CPU high;
Subsequent processing:
Do the stackpack performance evaluation of database;
Analyze thread it is derived as a result, in positioning application there may be the problem of, and result is fed back into technical specialist's group.
This defect is analyzed by technical specialist's group and perfect;
Five, database filing file takes memory space:
In two kinds of situation, one is current database archive files to take memory space.This needs are handled at once, directly
Login service device shifts archive file, then deletes corresponding archive file, and filing is directly deleted if there isn't enough time
File (is careful not to delete whole filings, two archive files at least wanting retention time nearest);
Second situation is estimated before DB Backup next time, and archive file can take memory space, this feelings
Condition preferably logs in backup server and manually performs primary backup, if it is determined that there isn't enough time then using the place of the first situation
Reason method;
It needs to make the method for manually performing backup in advance;
Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values:
Phenomenon and analysis:
Gone out by system monitoring process feedback, the space utilization rate of some file system (mount point) is more than pre- in unix system
Alert value;
Possible reason has: the file of a. manual backup generation, b. application, which report an error, generates a large amount of journal files, c. operation system
System, which reports an error, generates heap file;
Processing method:
1, the size cases of each catalogue and file are checked under the mount point by du-s* order;
2, the catalogue bigger into space hold, the size of catalogue and file is determined with same method;
3, judge that belonging to for reason is any.If it is a, then contacting related personnel's determination could delete;If it is b or c,
Then coupled system administrative staff or relevant manufactures confirm;
It 4,, be first by file backup to be deleted to locally if time and condition allow after confirmation can delete;
5, when deleting using rm order, to confirm that the command option used is completely correct.Batch is carried out using asterisk wildcard to delete
Except when, whether the usage that asterisk wildcard is verified in Yao Xianyong ls order correct, in order to avoid maloperation;
Seven, Mission critical applications strategy:
When database or application server failure occurs, suggest taking following emergency measure when the short time cannot restore:
1, the influence illustrated due to the system failure, to applied business is summarized, how to be restored normal;
2, according to pressure priority level, window, good service etc. separates the recovery sequence of applied business.
Present invention has an advantage that
The process flow of specification emergency is realized rapid, orderly, efficient malfunction elimination and is solved, utmostly contracts
Short fault time guarantees the safe and stable operation of marketing Base data platform system, improves operation and maintenance level.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in the embodiment of the present invention
Technical solution is clearly and completely described, it is clear that described embodiments are some of the embodiments of the present invention, rather than complete
The embodiment in portion.Based on the embodiments of the present invention, those of ordinary skill in the art are without creative efforts
Every other embodiment obtained, shall fall within the protection scope of the present invention.
Embodiment:
A kind of sales service system O&M emergency disposal and quick recovery method, including the processing of following situations:
One, data center's website disaster:
Phenomenon: water, fire, earthquake etc. bring the physical facility of website to paralyse;This support for needing that there is disaster tolerance system and guarantor
Shield.Production database is serious unavailable (all merging phenomenons, analysis, processing mode);
Phenomenon and analysis:
Phenomenon: operation system is integrally paused.
Analysis: two database detecting host operating systems of connection are all unavailable, or check discovery by operating system
The equipment of disk array is all inaccessible.
Processing mode:
The severity of further fault analysis and judgement is needed, then decides whether to enable the inquiry generated by data duplication
Library, as Production database.
It such as determines then to need following committed step using inquiry library:
1, stop process of the database duplication software on inquiry library;
2, stop business application;
3, by business application to the connected reference configuration of original Production database be all adjusted to inquiry library into
Row access;
4, application program is reactivated;
Two, loss of data is found:
Phenomenon and analysis:
It was found that there is loss of data, it is divided into two kinds of situations:
Find loss of data in business personnel's operating process, this have two may: 1, the loss of data 2 of a small amount of line number, a large amount of
Even whole table loss;
Host, database or storage system failure bring lose can not which kind of situation, require manual intervention
It could complete to restore.
(1) low volume data loss processing method:
The inconsistent situation of data is checked by application programming, developer, specific loss situation in the document without
Method describes, and in general, to analyze severity, and whether can by business operation be made up, can make up for it, then if investigating
It is completed by business operation;Otherwise " database recovery processing method " is used;
(2) pass through the processing method of database recovery:
Whether low volume data or mass data, as long as can not be restored by means such as business operation, operation flows,
It requires to complete to restore by the processing method of database recovery;
Database recovery is meant, using the standby system having built up, completes to restore by backup software, mainly
Update number using the restoration methods based on time point, at the time of restoring data to before loss of data, after this moment
According to needing the manual faithful record or added by data derived from backup logic.
Three, the CPU high of database host:
Phenomenon and analysis:
Macroscopic appearance: application response time is long, and customers' responsiveness system is slow;
Microphenomenon: with top-h order discovery on 10.231.XX.XX or 10.231.XX.XX;
The cpu avg utilization rate of USER is higher, as follows:
cpu states:(avg)
In detail performance: may for multiple oracle process cpu utilization rate close to 99% or major part cpu utilization rate all
It is higher;
Processing method:
1. checking the process of CPU high by top-h, process number, i.e. pid are obtained;
2. finding corresponding session with sql tool sentence, and execute SQL;
3. administrator's identity logs database server, execution kills this session (session);
There is lock in database for a long time (lock waits);
Emergency step:
1, the situation of lock is checked by sql tool;
2, observe which session starts locking;
3, the current sql of session is recorded;
If 4, the time that lock waits is more than 10 minutes, while the number locked is fewer, can kill process;
If the quantity 5, locked is relatively more, and exists and continue growing sign, lock should be also killed immediately;
6, acquired sql technical specialist's group is issued to analyze;
Four, application system performance is slow:
Processing method:
1. checking on database whether there is lock, if there is handling according to corresponding technical method;
2. checking whether database host CPU is high.If high, corresponding sql is searched, and according to treating database CPU high
Mode handle session;
3. checking whether application server is abnormal (CPU high, garbage reclamation exception, waiting list > 0), if it is, by answering
It is handled with the method for server CPU high;
Subsequent processing:
Do the stackpack performance evaluation of database;
Analyze thread it is derived as a result, in positioning application there may be the problem of, and result is fed back into technical specialist's group.
This defect is analyzed by technical specialist's group and perfect;
Five, database filing file takes memory space:
In two kinds of situation, one is current database archive files to take memory space.This needs are handled at once, directly
Login service device shifts archive file, then deletes corresponding archive file, and filing is directly deleted if there isn't enough time
File (is careful not to delete whole filings, two archive files at least wanting retention time nearest);
Second situation is estimated before DB Backup next time, and archive file can take memory space, this feelings
Condition preferably logs in backup server and manually performs primary backup, if it is determined that there isn't enough time then using the place of the first situation
Reason method;
It needs to make the method for manually performing backup in advance;
Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values:
Phenomenon and analysis:
Gone out by system monitoring process feedback, the space utilization rate of some file system (mount point) is more than pre- in unix system
Alert value;
Possible reason has: the file of a. manual backup generation, b. application, which report an error, generates a large amount of journal files, c. operation system
System, which reports an error, generates heap file;
Processing method:
1, the size cases of each catalogue and file are checked under the mount point by du-s* order;
2, the catalogue bigger into space hold, the size of catalogue and file is determined with same method;
3, judge that belonging to for reason is any.If it is a, then contacting related personnel's determination could delete;If it is b or c,
Then coupled system administrative staff or relevant manufactures confirm;
It 4,, be first by file backup to be deleted to locally if time and condition allow after confirmation can delete;
5, when deleting using rm order, to confirm that the command option used is completely correct.Batch is carried out using asterisk wildcard to delete
Except when, whether the usage that asterisk wildcard is verified in Yao Xianyong ls order correct, in order to avoid maloperation;
Seven, Mission critical applications strategy:
When database or application server failure occurs, suggest taking following emergency measure when the short time cannot restore:
1, the influence illustrated due to the system failure, to applied business is summarized, how to be restored normal;
2, according to pressure priority level, window, good service etc. separates the recovery sequence of applied business.
Although the present invention is disclosed above in the preferred embodiment, it is not intended to limit the invention the range of implementation.Any
The those of ordinary skill in field is not departing from invention scope of the invention, improves when can make a little, i.e., all according to this hair
Bright done same improvement, should be the scope of the present invention and is covered.
Claims (1)
1. a kind of sales service system O&M emergency disposal and quick recovery method, it is characterised in that: including the processing of following situations:
One, data center's website disaster:
The severity of further fault analysis and judgement is needed, then decides whether to enable the inquiry library generated by data duplication,
As Production database;
It such as determines then to need following committed step using inquiry library:
(1) stop process of the database duplication software on inquiry library;
(2) stop business application;
(3) business application is all adjusted to visit inquiry library to the connected reference configuration of original Production database
It asks;
(4) application program is reactivated;
Two, loss of data is found:
(1) low volume data loss processing method:
Check that the inconsistent situation of data, specific loss situation can not be use up in the document by application programming, developer
It states, in general, to analyze severity, whether can by business operation be made up, can make up for it, then pass through if investigating
Business operation is completed;Otherwise " database recovery processing method " is used;
(2) pass through the processing method of database recovery:
Whether low volume data or mass data, as long as can not be restored by means such as business operation, operation flows, all need
It to complete to restore by the processing method of database recovery;
Three, the CPU high of database host:
(1) process that CPU high is checked by top-h, obtains process number, i.e. pid;
(2) corresponding session is found with sql tool sentence, and executes SQL;
(3) administrator's identity logs database server, execution kill this session (session);
(4) there is lock in database for a long time (lock waits);
Emergency step:
(1) situation of lock is checked by sql tool;
(2) observe which session starts locking;
(3) the current sql of session is recorded;
(4) if the time that lock waits is more than 10 minutes, while the number locked is fewer, can kill process;
(5) if the quantity of lock is relatively more, and exist and continue growing sign, should also kill lock immediately;
(6) acquired sql technical specialist's group is issued to analyze;
Four, application system performance is slow:
(1) check on database whether there is lock, if there is handling according to corresponding technical method;
(2) check whether database host CPU is high.If high, corresponding sql is searched, and according to the side for treating database CPU high
Formula handles session;
(3) check whether application server is abnormal (CPU high, garbage reclamation exception, waiting list > 0), if it is, by application
The method of server CPU high is handled;
(4) subsequent processing: the stackpack performance evaluation of database is done;It is derived as a result, may in positioning application to analyze thread
There are the problem of, and result is fed back into technical specialist's group.This defect is analyzed by technical specialist's group and perfect;
Five, database filing file takes memory space:
In two kinds of situation, one is current database archive files to take memory space.This needs are handled at once, are directly logged in
Server shifts archive file, then deletes corresponding archive file, and archive file is directly deleted if there isn't enough time;
Second situation is estimated before DB Backup next time, and archive file can take memory space, and such case is most
It is to log in backup server to manually perform primary backup well, if it is determined that there isn't enough time then using the processing side of the first situation
Method;
It needs to make the method for manually performing backup in advance;
Six, the space utilization rate of some mount point of unix system is more than the processing method of early warning threshold values:
(1) size cases of each catalogue and file are checked under the mount point by du-s* order;
(2) enter the bigger catalogue of space hold, the size of catalogue and file is determined with same method;
(3) judge that belonging to for reason is any.If it is a, then contacting related personnel's determination could delete;If it is b or c, then
Coupled system administrative staff or relevant manufactures confirm;
It (4), be first by file backup to be deleted to locally if time and condition allow after confirmation can delete;
(5) when deleting using rm order, to confirm that the command option used is completely correct.Batch deletion is carried out using asterisk wildcard
When, whether the usage that asterisk wildcard is verified in Yao Xianyong ls order is correct, in order to avoid maloperation;
Seven, Mission critical applications strategy:
When database or application server failure occurs, suggest taking following emergency measure when the short time cannot restore:
(1) influence illustrated due to the system failure, to applied business is summarized, how to be restored normal;
(2) according to pressure priority level, window, good service etc. separates the recovery sequence of applied business.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811302525.5A CN109634777A (en) | 2018-11-02 | 2018-11-02 | A kind of sales service system O&M emergency disposal and quick recovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811302525.5A CN109634777A (en) | 2018-11-02 | 2018-11-02 | A kind of sales service system O&M emergency disposal and quick recovery method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109634777A true CN109634777A (en) | 2019-04-16 |
Family
ID=66067183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811302525.5A Pending CN109634777A (en) | 2018-11-02 | 2018-11-02 | A kind of sales service system O&M emergency disposal and quick recovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109634777A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347175A (en) * | 2020-11-11 | 2021-02-09 | 欧冶云商股份有限公司 | Cross-database remote measurement self-healing method and system |
-
2018
- 2018-11-02 CN CN201811302525.5A patent/CN109634777A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347175A (en) * | 2020-11-11 | 2021-02-09 | 欧冶云商股份有限公司 | Cross-database remote measurement self-healing method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11175982B2 (en) | Remote monitoring and error correcting within a data storage system | |
KR101856543B1 (en) | Failure prediction system based on artificial intelligence | |
US8863224B2 (en) | System and method of managing data protection resources | |
US8126848B2 (en) | Automated method for identifying and repairing logical data discrepancies between database replicas in a database cluster | |
JP5148607B2 (en) | Automation of standard operating procedures in database management | |
US9411969B2 (en) | System and method of assessing data protection status of data protection resources | |
CN105099783B (en) | A kind of method and system for realizing operation system alarm emergency disposal automation | |
US20080155091A1 (en) | Remote monitoring in a computer network | |
CN105955662A (en) | Method and system for expansion of K-DB data table space | |
CN113515499A (en) | Database service method and system | |
EP3202091B1 (en) | Operation of data network | |
CN108809729A (en) | The fault handling method and device that CTDB is serviced in a kind of distributed system | |
CN109634777A (en) | A kind of sales service system O&M emergency disposal and quick recovery method | |
US8056052B2 (en) | Populating service requests | |
CN105550094B (en) | A kind of high-availability system state automatic monitoring method | |
JP2017211722A (en) | Application support program, application support device and application support method | |
JP3992029B2 (en) | Object management method | |
Mukherjee et al. | Challenges of DB2 restore in a distributed systems environment and engineered solutions | |
CN106850305A (en) | A kind of IT operation management method and device | |
TWI690810B (en) | Database management system and database management method | |
CN115033649A (en) | Fault processing method, device, equipment and storage medium based on report development | |
CN116483672A (en) | Oracle database performance index acquisition method and device | |
CN107016101A (en) | Data managing method, apparatus and system | |
CN116701423A (en) | Method, device, equipment and storage medium for updating operation logic library | |
CN117851122A (en) | Disaster recovery backup and recovery system of power information system in cloud environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190416 |
|
WD01 | Invention patent application deemed withdrawn after publication |