CN113986594A

CN113986594A - Method, system, storage medium and server for real-time database fault recovery

Info

Publication number: CN113986594A
Application number: CN202111264932.3A
Authority: CN
Inventors: 何清; 王毅; 王奕飞; 谢贝贝; 何新
Original assignee: Xian Thermal Power Research Institute Co Ltd; Xian TPRI Power Station Information Technology Co Ltd
Current assignee: Xian Thermal Power Research Institute Co Ltd; Xian TPRI Power Station Information Technology Co Ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-01-28

Abstract

A method, a system, a storage medium and a server for real-time database fault recovery are provided, wherein the method comprises the following steps: starting a watchdog service process of a real-time database, and initializing a shared memory; starting and initializing other service processes of a real-time database, wherein the other service processes of the real-time database comprise a communication service process, a basic service process, a snapshot service process and a historical service process; other service processes apply for the shared memory needing the hot standby data to the watchdog service process and read and write the data; other service processes update the process running state to the shared memory at regular time; if the running state of a certain service process is not updated after overtime, restarting the service process; and the restarted service process applies for retrieving data from the shared memory hosting process for field recovery. The invention can realize quick service restart and data recovery, and greatly improves the service stability and data security of the real-time database.

Description

Method, system, storage medium and server for real-time database fault recovery

Technical Field

The invention belongs to the technical field of real-time database development, and particularly relates to a method, a system, a storage medium and a server for real-time database fault recovery.

Background

The implementation basis of the innovation strategy of the manufacturing industry of all countries is the collection and feature analysis of industrial big data and careless environment built for future manufacturing systems. The real-time database is a core service of industrial big data and is also the foundation of industrial 4.0.

The power generation enterprises need to collect mass production real-time data from industrial control systems such as DCS, auxiliary control and the like and store the mass production real-time data into a real-time database, the safety requirement on the data stored in the database is very high, and the data stored in the database is not allowed to be lost due to software defects.

Generally, a real-time database generally adopts a data caching strategy in order to improve the storage performance of mass production real-time data. However, this will cause the data that has been put into the storage to be not really archived, and if the service process crashes at this time, the data will be permanently lost. Meanwhile, when a certain service of the real-time database is crashed and restarted, a large amount of basic data in a disk may need to be loaded for initialization, and the service starting process is slow, so that the service cannot be quickly recovered, and long-time service interruption is caused.

Disclosure of Invention

The invention aims to provide a method, a system, a storage medium and a server for recovering a real-time database fault, which solve the problems of cache data loss and slow start initialization after a service process of a real-time database crashes.

In order to achieve the purpose, the invention has the following technical scheme:

in a first aspect, a method for real-time database failure recovery is provided, which includes the following steps:

starting a watchdog service process of a real-time database, and initializing a shared memory;

starting and initializing other service processes of a real-time database, wherein the other service processes of the real-time database comprise a communication service process, a basic service process, a snapshot service process and a historical service process;

other service processes apply for the shared memory needing the hot standby data to the watchdog service process and read and write the data;

other service processes update the process running state to the shared memory at regular time;

if the running state of a certain service process is not updated after overtime, restarting the service process;

and the restarted service process applies for retrieving data from the shared memory hosting process for field recovery.

As a preferred scheme of the method of the invention, the watchdog service process monitors other service processes on one hand, and automatically restarts the corresponding service process if abnormality or crash is found; on the other hand, all shared memories needing hot standby data are managed, so that the shared memories are not recycled when other service processes are abnormally quitted;

the watchdog service process can be detected backwards and restarted if an exception or crash is found.

As a preferred scheme of the method of the present invention, the communication service process forwards the function called by the API to the corresponding service process for processing, and returns the response message to the API caller.

As a preferred scheme of the method, the basic service process loads the measuring point table and stores the measuring point table in the memory of the shared memory Hash table for other service processes to inquire.

As a preferred scheme of the method of the present invention, the snapshot service process is used to implement caching and compression of snapshot data, at least one snapshot data of each measurement point is cached in the shared memory of the snapshot service process, if the measurement point supports compression, other data related to the compression calculation is also cached in the shared memory of the snapshot service process; when some measuring point snapshot data is written in, if the measuring point snapshot data is compressed, the original snapshot data can be directly covered, and if the measuring point snapshot data is not compressed, the measuring point snapshot data can be pushed to the history service process.

As a preferred scheme of the method, the historical service process is used for archiving and querying historical data, and the historical data is cached for each measuring point according to the size of a data page, and is stored in a shared memory of the historical service process; the snapshot data pushed by the snapshot service process is firstly written into the historical data cache, when the historical data cache of a certain measuring point is fully written, the historical data cache is filed and written into an archive file, and then the historical data cache is emptied to continue waiting for receiving other historical data.

As a preferred solution of the method of the present invention, the shared memories can be named, and the shared memories corresponding to the same shared memory name are the same shared memory block;

when multiple service processes hold the same shared memory block, as long as one service process has not released the shared memory block, the shared memory will not be recovered, and other service processes can acquire the access right of the shared memory block again.

In a second aspect, a system for real-time database failure recovery is provided, including:

the watchdog service process starting module is used for starting a watchdog service process of the real-time database and initializing the shared memory;

the other service process starting module is used for starting and initializing other service processes of the real-time database, wherein the other service processes of the real-time database comprise a communication service process, a basic service process, a snapshot service process and a historical service process;

the shared memory application module is used for applying the shared memory needing the hot standby data to the watchdog service process by other service processes and reading and writing the data;

the process state updating module is used for updating the process running state to the shared memory at regular time by other service processes;

the process restarting module is used for restarting the service process if the running state of a certain service process is not updated after overtime;

and the field recovery module is used for applying the restarted service process to the shared memory hosting process for retrieving data for field recovery.

In a third aspect, a computer-readable storage medium is provided, storing a computer program that, when executed by a processor, implements the method for real-time database failure recovery.

In a fourth aspect, a server is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the method for real-time database failure recovery when executing the computer program.

Compared with the prior art, the first aspect of the invention has at least the following beneficial effects:

the real-time database service process stores data needing hot standby in a shared memory, and monitors the running states of all other service processes used for service processing in the real-time database through the watchdog service process, when a certain service process crashes, the watchdog service process restarts the crashed service process, and the data is quickly recovered through the hot standby shared memory, so that the risk of data loss caused by crash of the service process is avoided, and meanwhile, the initialization speed of the restarting service process can be accelerated to quickly recover the service. According to the invention, the watchdog service process and the business service process jointly hold the shared memory, so that the quick restart of the business service process during abnormal operation or breakdown is realized, and the operating memory data before restart can be recovered, thereby realizing quick service restart and data recovery, and greatly improving the service stability and data security of the real-time database.

It is understood that the beneficial effects of the second to fourth aspects can be seen from the description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a diagram illustrating the operational relationship between a watchdog service process and other service processes according to the present invention;

FIG. 2 is a flow chart of a method for real-time database failure recovery in accordance with the present invention;

FIG. 3 is a diagram illustrating a shared memory holding structure according to the present invention;

FIG. 4 is a diagram illustrating a naming scheme of a shared memory according to the present invention;

FIG. 5 is a flow chart of monitoring a service process according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the following detailed description and the accompanying drawings. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

The invention provides a method for recovering a real-time database fault, wherein the real-time database is divided into different service processes according to functions, the service processes comprise a watchdog service process, a communication service process, a basic service process, a snapshot service process and a historical service process, and cross-process communication and coordinated operation are realized among the processes through an inter-process communication protocol (IPC).

As shown in fig. 2, in one embodiment, the method of the present invention comprises the following steps:

step S11: starting a watchdog service process of a real-time database, and initializing a shared memory block;

step S12: sequentially starting and initializing other service processes of the real-time database;

step S13: the real-time database service process applies for a shared memory needing hot standby data to the watchdog service process and reads and writes the data;

step S14: the real-time database service process collects the process running state to the shared memory service process running state area at regular time;

step S15: restarting the abnormal service process if the running state is not reported after the overtime of a certain service process is found, otherwise, turning to the step S14;

step S16: the restarted service process applies for the retrieved data from the shared memory hosting process for fast field recovery, and then goes to step S14.

As shown in fig. 3, in the schematic view of holding all service processes and shared memories in the real-time database, the watchdog service process holds all shared memories, and ensures that all shared memories are not recovered by the system after other service processes crash; the communication service process holds a service running state shared memory block S21, which is used for updating the running state of the process; the basic service process holds a service running state shared memory block S21 used for updating the running state of the process, and simultaneously holds an allocated counter shared memory block S22 and a measuring point data cache shared memory block set S23 used for storing attribute information of all measuring points; the snapshot service process holds a service running state shared memory block S21 for updating the running state of the process itself, and simultaneously holds an allocated counter shared memory block S22 and a snapshot data cache shared memory block set S24 for storing snapshot cache data; the history service process holds a service operation state shared memory block S21 for updating the operation state of its own process, and also holds an allocated counter shared memory block S22 and a history data cache shared memory block set S25 for storing history cache data.

As shown in fig. 4, the shared memory held by the service processes of the real-time database is named according to a certain rule, which is convenient for different service processes to quickly obtain the required shared memory block. The shared memory block S31 in the service running state and the shared memory block S32 in the allocated counter are globally only one, so the shared memory blocks are given fixed names, which are rtdb _ service _ status.dat and rtdb _ shared _ counter.dat; the three shared memory block sets, i.e., the measurement point table shared memory block set S33, the snapshot data cache shared memory block set S34, and the history data cache shared memory block set S35, are all composed of multiple shared memory blocks of fixed size, and according to different shared memory block sets, the shared memory block names are composed of prefixes and sequence numbers, and as long as the service process knows the shared memory block allocation counter value, all the shared memory block names of the shared memory set, such as rtdb _ snapshot _00000001.dat, rtdb _ snapshot _00000002.dat, and rtdb _ snapshot _00000003.dat … …, can be formatted

When the watchdog service process of the real-time database is started in step S11, the goal is to ensure that the watchdog service process is the first started service process of the real-time database; after the watchdog service process is started, two global shared memory blocks, i.e., a service running state shared memory block S21 and an allocated counter shared memory block S22, are initialized first, and named as S31 and S32, which are rtdb _ service _ status. The service running state shared memory block comprises five data units which are respectively used for storing running states of watchdog service processes, communication service processes, basic service processes, snapshot service processes and historical service processes, and running state information comprises running state update timestamps of the service processes, working states of important submodules and working threads and the like. The watchdog service process judges whether other service processes normally operate according to the information; the allocated counter shared memory blocks include the number of shared memory blocks allocated to the three sets of shared memory blocks, i.e., the measurement point table shared memory block set S23, the snapshot data cache shared memory block set S24, and the history data cache shared memory block set S25. The required capacity of the three shared memory blocks is different in size under different measuring point scales, the three shared memory blocks cannot be allocated enough at one time, but are allocated according to a certain fixed size unit, and when the allocated shared memory is insufficient, allocation is applied again.

Referring to fig. 1, in the step S12, after the watchdog service process is started, if it is detected that no other service process is started, the basic service process S3, the history service process S5, the snapshot service process S4, and the communication service process S2 are sequentially started, and after each service process is started in the step S13, corresponding data is loaded for initialization, which is specifically as follows:

the basic service process S3 loads the attribute information of the measure point from the measure point table file and stores the measure point attribute information in a shared memory block set applied from the watchdog service process in a Hash table form, on one hand, the snapshot service process S4 and the history service process S5 can quickly query the measure point attribute information across processes, on the other hand, the measure point table shared memory basic service process S3 and the watchdog service process S1 share, and after the basic service process S3 is crashed and restarted, the measure point data in the shared memory set can be directly obtained without secondary loading from the measure point table file again, so that the service recovery speed of the basic service process S3 is improved.

And the history service process S5 allocates a history data cache block for each measuring point from the shared memory block set applied by the watchdog service process S1, and caches the history data of the measuring point for the user. When the history service process S5 crashes and restarts, the historical cache data of the measuring point in the shared memory set can be directly obtained, thereby recovering the historical cache data before crashing and ensuring that the data is not lost.

And the snapshot service process S4 allocates a snapshot cache data block to each measurement point from the shared memory block set applied by the watchdog service process S1, and caches snapshot data for the user. When the snapshot service process S4 crashes and restarts, the point snapshot cache data in the shared memory set can be directly obtained, so as to recover the snapshot cache data before the crash, and ensure that the data is not lost.

And the communication service process S2 initializes the communication service process S2 after all other service processes are initialized, and then the starting process of the whole real-time database is completed.

In the step S14, each service process periodically updates the running state of the process to the service running state shared memory block S21, where the running state information includes a running state update timestamp of each service process, working states of the important sub-modules and threads, and the like, and when the process runs abnormally or crashes and exits, the running state stops updating, and the watchdog service process may find that the corresponding service process runs abnormally, thereby performing service recovery.

As shown in fig. 5, the running state of other service processes is detected by the watchdog service process, and when the service process is found to be abnormal or crashed, a quick service process restart and data recovery process is adopted. The method specifically comprises the following steps:

step S41: the watchdog service process detects whether the target service process exists, if the process does not exist, the step S42 is continued, otherwise, the step S43 is skipped;

step S42: the watchdog service process detects a state update flag of the target service process in the service running state shared memory block S21, and if the state update flag is not updated after time out, which indicates that the service process exists but runs abnormally, the service process needs to be forcibly exited;

step S43: the watchdog service process restarts the target service process, and after the target service process is started, the target service process first acquires a service running state shared memory block S21 and an allocated counter shared memory block S22, which are already held by the watchdog service process;

s44: if the target service process is a basic service process, acquiring a measurement point table shared memory block set S23, completing the rapid loading of the measurement point table shared memory, and recovering service; if not, continuing the next step;

s45: if the target service process is the snapshot service process, acquiring a snapshot data cache shared memory block set S24, completing the quick loading of the snapshot data cache shared memory, and recovering service; if the process is not the snapshot service process, continuing to the next step;

s46: if the target service process is a historical service process, acquiring a historical data cache shared memory block set S25, completing the quick loading of the historical data cache shared memory, and recovering service; if not, continuing the next step;

s47: if the target service process is a communication service process, initializing a TCP network server and recovering service;

s48: and completing quick recovery of the service process and corresponding data, and jumping to S41 to continue the next round of service process monitoring flow.

If the watchdog service process crashes, the steps S15 and S16 cannot be executed, and the real-time database loses the corresponding functions. In order to avoid the situation, in the implementation process of the method provided by the invention, the service process is allowed to reversely detect the running state of the watchdog service process, and if the watchdog service process is found to run abnormally or crash, the communication service process restarts the watchdog service process. After the watchdog service process is restarted, the service running state shared memory block S21 and the allocated counter shared memory block S22 are obtained, then a shared memory set among the measurement point data cache shared memory block set S23, the snapshot data cache shared memory block set S24, and the history data cache shared memory block set S25 is obtained according to the counter shared memory block S22, and service is recovered, thereby ensuring that all processes of the whole real-time database are in a monitoring state.

In another embodiment, there is also provided a system for real-time database failure recovery, comprising:

In another embodiment, a computer-readable storage medium is also provided, storing a computer program that, when executed by a processor, implements the method for real-time database failure recovery.

In another embodiment, a server is also provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the method for real-time database failure recovery when executing the computer program.

Illustratively, the computer program may be partitioned into one or more modules/units, which are stored in a computer-readable storage medium and executed by the processor to perform the steps of the method for real-time database failure recovery of the present application. The one or more modules/units may be a series of computer-readable instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the server.

The server can be a notebook computer, a desktop computer, a cloud server and other computing devices. The server may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the server may also include more or fewer components, or some components in combination, or different components, e.g., the server may also include input output devices, network access devices, buses, etc.

The Processor may be a CentraL Processing Unit (CPU), other general purpose Processor, a DigitaL SignaL Processor (DSP), an AppLication Specific Integrated Circuit (ASIC), an off-the-shelf ProgrammabLe Gate Array (FPGA) or other ProgrammabLe logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage may be an internal storage unit of the server, such as a hard disk or a memory of the server. The memory may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure DigitaL (SD) Card, a FLash memory Card (FLash Card), or the like provided on the server. Further, the memory may also include both an internal storage unit of the server and an external storage device. The memory is used to store the computer readable instructions and other programs and data needed by the server. The memory may also be used to temporarily store data that has been output or is to be output.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the method embodiment, and specific reference may be made to the part of the method embodiment, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for real-time database failure recovery, comprising the steps of:

2. The method for real-time database failure recovery according to claim 1, wherein: on one hand, the watchdog service process monitors other service processes, and if abnormity or crash is found, the corresponding service process is automatically restarted; on the other hand, all shared memories needing hot standby data are managed, so that the shared memories are not recycled when other service processes are abnormally quitted;

3. The method for real-time database failure recovery according to claim 1, wherein: and the communication service process forwards the message to a corresponding service process for processing according to the function called by the API, and returns a response message to the API caller.

4. The method for real-time database failure recovery according to claim 1, wherein: and the basic service process loads the measuring point table and stores the measuring point table in a long-term memory in a storage form of a shared memory Hash table for other service processes to inquire.

5. The method for real-time database failure recovery according to claim 1, wherein: the snapshot service process is used for realizing the caching and compression of snapshot data, at least one snapshot data of each measuring point is cached in a shared memory of the snapshot service process, and if the measuring point supports the compression, other data related to the compression calculation are cached in the shared memory of the snapshot service process; when some measuring point snapshot data is written in, if the measuring point snapshot data is compressed, the original snapshot data can be directly covered, and if the measuring point snapshot data is not compressed, the measuring point snapshot data can be pushed to the history service process.

6. The method for real-time database failure recovery according to claim 1, wherein: the historical service process is used for archiving and inquiring historical data, and caching the historical data of each measuring point according to the size of a data page, wherein the historical data cache is stored in a shared memory of the historical service process; the snapshot data pushed by the snapshot service process is firstly written into the historical data cache, when the historical data cache of a certain measuring point is fully written, the historical data cache is filed and written into an archive file, and then the historical data cache is emptied to continue waiting for receiving other historical data.

7. The method for real-time database failure recovery according to claim 1, wherein: the shared memories can be named, and the shared memories corresponding to the same shared memory name are the same shared memory block;

8. A system for real-time database failure recovery, comprising:

9. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements a method for real-time database failure recovery as claimed in any one of claims 1 to 7.

10. A server comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements a method for real-time database failure recovery as claimed in any one of claims 1 to 7.