CN115718674A - Data disaster tolerance recovery method and device - Google Patents

Data disaster tolerance recovery method and device Download PDF

Info

Publication number
CN115718674A
CN115718674A CN202211378302.3A CN202211378302A CN115718674A CN 115718674 A CN115718674 A CN 115718674A CN 202211378302 A CN202211378302 A CN 202211378302A CN 115718674 A CN115718674 A CN 115718674A
Authority
CN
China
Prior art keywords
data
node
evaluation
recovered
priority list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211378302.3A
Other languages
Chinese (zh)
Inventor
童轩
李杨
牛海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Tianhe Defense Technology Co ltd
Original Assignee
Xi'an Tianhe Defense Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Tianhe Defense Technology Co ltd filed Critical Xi'an Tianhe Defense Technology Co ltd
Priority to CN202211378302.3A priority Critical patent/CN115718674A/en
Publication of CN115718674A publication Critical patent/CN115718674A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application is applicable to the technical field of data storage, and provides a data disaster recovery method and a data disaster recovery device, wherein the method comprises the following steps: determining the fault type and the range of data to be recovered in the distributed cluster system; acquiring state parameters of a plurality of normal nodes in a distributed cluster system, service sensitivity characteristics of data to be recovered and data block access characteristics; carrying out data preprocessing operation; respectively adopting a service sensitivity evaluation model and a node state evaluation model to determine a data recovery priority list and a node priority list; determining a fault handling strategy according to the data recovery priority list and the node priority list; and executing a fault processing strategy. According to the method, on the basis of dynamically updating the data recovery priority based on the service sensitivity evaluation model, the real-time evaluation algorithm of the heterogeneous big data node is added, and compared with the prior art, the method and the system can improve the data disaster recovery efficiency while reducing the fault perception of a service system.

Description

Data disaster tolerance recovery method and device
Technical Field
The present application relates to the field of data storage technologies, and in particular, to a method and an apparatus for recovering from a data disaster.
Background
In recent years, with the explosive development of the big data era, big data distributed clusters play an increasingly important role as a storage foundation of cloud computing. A File System of a Distributed cluster usually uses a Hadoop Distributed File System (HDFS) to perform volume processing and streaming batch processing on mass data. Meanwhile, node faults are very easy to occur in the distributed cluster due to the unreliable network connection and the bandwidth limitation. In this context, the development of data disaster recovery technology is necessary. At present, a data disaster recovery technology can check each node through failure detection, and implement backup and recovery of data in a big data distributed cluster system through a means of recovering data in a failed node. However, in the prior art, the factors considered in the research scheme of the data disaster recovery technology are limited, and the data disaster recovery capability of the cluster system is low.
Disclosure of Invention
In view of this, embodiments of the present application provide a data disaster recovery method and apparatus, which can solve the problems of high service system fault awareness and low data disaster recovery efficiency.
In a first aspect, an embodiment of the present application provides a method for recovering from a data disaster, where the method is applied to a distributed cluster system, where the distributed cluster system includes multiple nodes, and the method includes:
when determining that a node in the distributed cluster system fails, determining a failure type and a range of data to be recovered;
when the fault type is a data node fault type, acquiring a first evaluation data set and a second evaluation data set, wherein the first evaluation data set comprises service sensitivity characteristics and/or data block access characteristics of the data to be recovered, and the second evaluation data set comprises state parameters of nodes except for the fault data node in the plurality of nodes;
preprocessing the first evaluation data set to obtain a preprocessed first evaluation data set, and preprocessing the second evaluation data set to obtain a preprocessed second evaluation data set;
determining a data recovery priority list according to the preprocessed first evaluation data set and a service sensitivity evaluation model, wherein the service sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the preprocessed first evaluation data set, and the data recovery priority list is used for representing the priority order of the data to be recovered;
determining a node priority list according to the preprocessed second evaluation data set and a node state evaluation model, wherein the node state evaluation model is used for evaluating the real-time state of each node according to the preprocessed second evaluation data set and outputting the node priority list, the node priority list comprises at least one node, the at least one node is a node of the plurality of nodes, the state of the node meets a preset condition, and the node priority list is used for representing the priority order of the at least one node;
determining a fault handling strategy according to the data recovery priority list and the node priority list, wherein the fault handling strategy is as follows: according to the data recovery priority list, sequentially sending data to be recovered to the at least one node;
and executing the fault handling strategy.
In the embodiment of the application, after the fault type of a node is determined to be a data node fault and a range of data to be recovered, the service sensitivity characteristic and/or the data block access characteristic of the data to be recovered and the data state parameters of a plurality of nodes except the fault data node are obtained, after the obtained data are preprocessed, the service sensitivity evaluation model is called to dynamically update the recovery priority of the data block according to the preprocessed data, the node state evaluation model is called to evaluate the real-time state of the node in real time, and the effect of sending the data block which is urgently needed to be recovered to the node with the optimal state to execute the recovery operation can be achieved. Compared with the prior art, the embodiment of the application considers the service sensitivity and the difference of the cluster nodes, can fully exert the cluster advantages, and reduces the fault perception of a service system while improving the recovery efficiency of the data block.
In one possible implementation, the service sensitivity characteristics of the data to be recovered include one or more of the following characteristics: the priority characteristic of the business system to which the data block belongs, the priority characteristic of the function module to which the data block belongs, the business attribute characteristic of the data block and the timeliness characteristic of the data block.
In a possible implementation manner, the business sensitivity evaluation model is configured to output a priority of data to be recovered according to the preprocessed first evaluation data set, and includes:
the service sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the heat value characteristics of the data to be recovered and the service sensitivity characteristics of the data to be recovered;
and the heat value characteristic of the data to be recovered is obtained by analyzing the access characteristic of the data block based on a neural network algorithm.
In one possible implementation, the data block access characteristic is determined according to one or more of the following parameters: file name, file operation type, file operation time and file operation authority.
That is to say, the data is analyzed and mined by introducing the service sensitivity evaluation model so as to recover the data which is higher in service sensitivity and is urgently needed to be recovered in the service system preferentially, and the effect of reducing the user fault perception degree can be achieved.
In a possible implementation manner, the node state evaluation model is configured to evaluate a real-time state of each node according to the preprocessed second evaluation data set, and output the node priority list, and includes:
the node state evaluation model is used for determining the node priority list according to the state parameters of all the nodes and the historical state evaluation values of all the nodes;
wherein the status parameters include one or more of the following parameters: CPU utilization, memory utilization, disk I/O utilization, network bandwidth utilization, command response time, command queue length, and disk utilization.
And updating the real-time state evaluation value of the node in real time based on the node state evaluation model, so as to be beneficial to determining at least one node which can be used for executing data recovery operation, and prioritizing the at least one node, thereby ensuring that the node with the optimal state in the at least one node is used for processing data which has high service sensitivity and is in urgent need of recovery. By the method, the cluster advantages can be greatly played, and the data disaster recovery efficiency is improved.
In one possible implementation, the executing the fault handling policy includes:
and sequentially sending the data to be recovered to the at least one node for processing according to the priority order of the data to be recovered and the priority order of the at least one node.
In a possible implementation manner, after sequentially sending the data to be restored to the at least one node for processing according to the priority order of the data to be restored and the priority order of the at least one node, the method further includes:
and if the data to be recovered still has residual data, sending the residual data to a node which finishes data processing in the at least one node, wherein the residual data refers to data which is not processed by the at least one node in the data to be recovered.
In one possible implementation, the pre-processing operation includes one or more of the following:
the method comprises the steps of class type feature processing, null value processing and data normalization processing, wherein the class type feature processing is used for converting non-numerical data into numerical data.
In a second aspect, an embodiment of the present application provides a data disaster recovery apparatus, configured to execute the method in the first aspect or any possible implementation manner of the first aspect. In particular, the apparatus comprises means (or elements) for performing the method of the first aspect described above or any possible implementation manner of the first aspect.
In a third aspect, an embodiment of the present application provides a distributed cluster system, which includes a plurality of nodes. The plurality of nodes mainly comprise two types of nodes, namely name nodes and data nodes. Optionally, the plurality of nodes may further include an auxiliary name node. The distributed cluster system is configured to execute the method for recovering the data disaster tolerance described in any implementation manner of the first aspect of the embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a computer program product, which, when running on an apparatus for data disaster recovery, causes the apparatus for data disaster recovery to perform the method of the first aspect or any possible implementation manner of the first aspect.
Compared with the prior art, the embodiment of the application has the advantages that: the embodiment of the application designs a data disaster recovery method aiming at a distributed cluster system, and the method utilizes node resources to the maximum extent through a real-time evaluation algorithm of heterogeneous big data nodes, thereby effectively improving the data disaster recovery efficiency. Meanwhile, the embodiment of the application dynamically updates the data recovery priority based on the service sensitivity evaluation model, can reduce the fault perception of the service system, and can recover data through real-time evaluation of the nodes and the data recovery priority, so that the service quality of the platform can be improved while the robustness of the big data distributed cluster system and the service system is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of an application scenario of a data disaster recovery method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data disaster recovery method according to an embodiment of the present application;
FIG. 3 is a schematic block diagram of an evaluation model provided by an embodiment of the application;
fig. 4 is a flowchart illustrating a method for determining a data recovery priority list according to an embodiment of the present application;
fig. 5 is a flowchart illustrating a method for determining a node priority list according to an embodiment of the present application;
FIG. 6 is a flow chart illustrating a data preprocessing process provided by an embodiment of the present application;
fig. 7 is a flowchart of an example of a data disaster recovery method according to an embodiment of the present application;
fig. 8 is a block diagram of a data disaster recovery apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of data disaster recovery according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The embodiment of the application is applied to the distributed cluster system. The distributed cluster system comprises a plurality of nodes, wherein the plurality of nodes in the distributed cluster system mainly comprise two types of nodes, namely name nodes (name nodes) and data nodes (data nodes). Generally, the plurality of nodes may include a name node and a plurality of data nodes.
Optionally, the multiple nodes may further include an auxiliary name node, and the auxiliary node may back up data of the name node, so as to avoid data loss caused by crash of the name node.
In one possible implementation, the distributed cluster system may be an HDFS system. The HDFS system architecture adopts an intermediate control node architecture, wherein when data needs to be read and written, name nodes for storing metadata need to be accessed first to obtain storage information of actual data, and then data nodes for storing the actual data are read and written.
To facilitate understanding, prior to the description of the embodiments of the present application, a brief description of some terms or concepts related to the embodiments of the present application will be provided.
And (3) node: the functions of storing data, indexing and searching can be realized. Each node has a unique name as its identity. A plurality of nodes may form a cluster.
The name node (or management node) is mainly used for storing metadata information of a distributed cluster system (such as an HDFS system), maintaining a namespace of a file system, responding to read-write requirements of a client, and sending instructions to the data node. The name node includes an image file (Fslmage) for saving a directory tree of the file system, and an operation log (editlg) for modifying the directory tree.
Data nodes, which are software nodes running on individual machines in a distributed cluster system (e.g., HDFS system), are typically organized in racks. The data nodes are mainly used for storing the actual data blocks and performing read/write operations of the data blocks.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In a possible application scenario, the method for recovering the data disaster tolerance provided by the embodiment of the present application is applicable to an application scenario for a heterogeneous big data distributed cluster system.
The following is described as an example in connection with the specific scenario in fig. 1.
Fig. 1 is a schematic view of an application scenario of the data disaster recovery method provided in an embodiment of the present application, and as shown in fig. 1, the application scenario for a heterogeneous big data distributed cluster system includes a client, a name node, and N data nodes. Optionally, the heterogeneous big data distributed cluster system may further include an auxiliary name node.
The client is used for segmenting files needing to be written into the HDFS, interacting with name nodes and data nodes, and managing and accessing the HDFS. Illustratively, the client types include an HDFS Command line (HDFS Shell Command), java API, C API Libhdfs, and other ways to access HDFS. Wherein, the Java API uses Java code to access the HDFS. In the C API Libhdfs, libhdfs is a C API for HDFS and is pre-compiled in a Hadoop release, supports a C language client and can be used for operating the HDFS and a file system.
The data node responds to read-write requests of the client and responds to commands of creating, deleting and copying name nodes. The data node creates, deletes and replicates the data blocks stored thereon. Wherein the data blocks are stored in the form of files on the data nodes, the data blocks comprise the data itself and metadata, the metadata comprising the length of the data, the checksum timestamp of the block data. The data may be stored on different storage media in the data node according to the heat value of the data. For example, frequently accessed data is stored in a storage medium (e.g., a memory or an SSD) with higher access performance, so as to improve the read/write performance thereof; data that is hardly accessible is saved on an archival storage medium to reduce its storage cost.
In order to provide a simple explanation of the function or action of the terms referred to in fig. 1. It should be noted that the names of the terms shown in fig. 1 are merely exemplary descriptions, and the embodiments of the present application are not limited thereto. In fact, the names of the terms may be replaced by other names as long as the corresponding functions are provided.
Furthermore, for convenience of description, hereinafter, the heterogeneous big data distributed cluster system is simply referred to as a distributed cluster system.
In the embodiment of the present application, when a node failure occurs in the distributed cluster system, data needs to be backed up and restored, so as to improve the robustness of the cluster and the service system. In view of this, the present application provides a data disaster recovery method, which can achieve the purpose of improving the data disaster recovery efficiency and reducing the fault perception of the service system.
The following describes a procedure of performing data disaster recovery according to an embodiment of the present application with reference to fig. 2 to 7.
Fig. 2 is a schematic flow chart of a data disaster recovery method according to an embodiment of the present application. It should be understood that the method of fig. 2 may be applied to the application scenario shown in fig. 1 by way of example and not limitation. As shown in fig. 2, the method comprises the steps of:
step S110: and when determining that the node in the distributed cluster system has a fault, determining the fault type and the range of the data to be recovered.
Optionally, the fault types include name node fault and data node fault. For data node failure, the method for determining data node failure in the embodiments of the present application is not limited.
Optionally, as a possible implementation manner, whether the data node fails may be determined by monitoring whether the data node sends heartbeat information to the name node within a preset time period.
In other words, from the perspective of the name node, if the name node does not receive the heartbeat information beyond the preset time threshold, it is determined that the fault type is a data node fault.
Illustratively, the data node failure is caused by one or more of:
(1) The data blocks stored in the data node lack a copy;
(2) The data block is repeatedly copied due to the error of the data block check;
(3) The name node cannot receive the heartbeat information of the data node due to network failure or data node failure.
In order to guarantee the safety of data stored on the data nodes, the data nodes are monitored through a heartbeat mechanism, and the working principle of the heartbeat mechanism is that after the HDFS cluster is started, the data nodes send heartbeat package registration information to name nodes and periodically send heartbeat information to the name nodes to monitor failure. Generally, when the heartbeat information sent by the data node is not received by the name node for more than ten minutes, the data node is considered to be invalid or failed.
Optionally, as a possible implementation manner, if the name node is in the security mode for a long time, it is determined that the failure type is the name node failure.
In other words, if the name node is in the security mode for a long time and cannot provide the read-write function to the outside, that is, the name node is in a state of being unable to respond to the read-write request of the client for a long time, it is determined that the fault type is the name node fault.
Illustratively, the name node failure is a name node cold backup failure.
It can be understood that when the auxiliary name node is deployed in the distributed cluster system, the name node failure means that the name node itself fails or the auxiliary name node fails.
It is understood that before step S110, whether a node in the distributed cluster system fails may be monitored. The embodiment of the present application does not limit the specific manner in which the monitoring node fails. For example, a fault detection module or a node state detection module may be arranged in the distributed cluster system to monitor whether a node fault occurs.
Optionally, the range of the data to be recovered refers to a range of the data to be recovered, which is affected by the failed node. The range of data to be recovered may also be referred to as the range of data blocks to be recovered.
Illustratively, the range of data to be recovered includes data damaged due to a failed node, and data affected directly or indirectly due to the failed node.
Optionally, as a possible implementation manner, the range of the data to be restored is determined by a default API component of the Hadoop component.
In another expression, after the fault type is determined to be the data node fault, a list of data blocks stored on the fault data node is determined through a Hadoop component default API component.
Illustratively, in some embodiments of the present application, a node failure of a distributed node is monitored and early warned by an open-source big data cluster monitoring platform. And particularly, the open-source big data monitoring platform can monitor node faults which are mainly characterized by data node faults and name node faults.
The open source big data monitoring platform (WGCLOID) is software which can be decompressed and can automatically monitor and operate, and the WGCLOID supports operation on a Linux system, a Windows system, a Unix system, a MacOS system, other ARM and android systems and the like.
The open-source big data monitoring platform monitors name nodes and data nodes in the distributed cluster system. And when the open source big data monitoring platform detects that the node fails, judging the node failure type according to the monitored failure type.
And when the node is monitored to be abnormal, judging the fault type of the node and the range of the data to be recovered. If the fault type is name node fault, the name node is directly exited from the safe mode. If the fault type is a data node fault, acquiring a range of data to be recovered, executing step S120, and immediately starting a cluster data disaster recovery process. It should be understood that the above-mentioned manner for determining the failure of the name node is only an exemplary description, and other manners may be used, for example, determining whether the start-up time of the name node exceeds the time threshold. In this regard, the embodiments of the present application are not limited thereto.
For example, in some embodiments of the present application, a big data management platform may also be used to monitor and early warn a node fault of a distributed node. And specifically, the big data management platform can monitor node failures which are mainly characterized by data node failures and name node failures.
The big data management platform is a Web-based tool, supports the supply, management and monitoring of Apache Hadoop clusters, and currently supports most Hadoop components, including a distributed storage system (HDFS), a distributed computing system (MapReduce), a data warehouse tool (Hive), a large-scale data analysis platform (Apache Pig), a distributed open source Database (Hadoop Database, hbase), distributed application program coordination service software (Zookeeper), a data migration tool (SQL to Hadoop, sqoop), a Hadoop table storage management tool (Hcatalog), and the like.
Step S120: and when the fault type is a data node fault type, acquiring a first evaluation data set and a second evaluation data set, wherein the first evaluation data set comprises the service sensitivity characteristics and/or the data block access characteristics of the data to be recovered, and the second evaluation data set comprises the state parameters of nodes except the fault data node in the plurality of nodes.
It should be understood that the state parameter of the node other than the failed data node in the plurality of nodes refers to the state parameter of the normal data node other than the failed data node in the distributed cluster system.
In some embodiments of the present application, when it is determined that the fault type is a data node fault type, data of a first evaluation data set and data of a second evaluation data set in a distributed cluster system may be acquired by the open-source big data monitoring platform. For example, assuming that an open-source big data monitoring platform is used to monitor the distributed cluster system, the open-source big data monitoring platform may monitor various index monitoring of a host in real time, for example, through one or more indexes of each node monitored on the open-source big data monitoring platform, the first evaluation data set and the second evaluation data set may be obtained.
Illustratively, the first evaluation data set includes one or more of the following indicators: resources such as process applications, files, ports, logs, data blocks, data tables, etc. on the server.
Illustratively, the second evaluation data set includes one or more of the following indicators: CPU utilization, CPU temperature, memory utilization, disk capacity, disk IO, hard disk SMART health status, system load, number of connections, network card traffic, hardware system, and the like.
It should be understood that the one or more indicators included in the first evaluation data set and the second evaluation data set are only described as examples of indicators monitored by the open-source big data monitoring platform, and the embodiments of the present application are not limited thereto.
It should be understood that the above description is only exemplary with respect to the first evaluation data set, and that the first evaluation data set may have other naming schemes. In this regard, the embodiments of the present application are not limited thereto.
It should be understood that the above description is only exemplary with respect to the second evaluation data set, which may also have other nomenclature. In this regard, the embodiments of the present application are not limited thereto.
Step S130: and preprocessing the second evaluation data set to obtain a preprocessed second evaluation data set.
In the data acquisition process, the problems of incomplete data acquisition, noise and inconsistent data types are often caused by different consideration factors, human factor interference or elements with problems in software, and the acquired data has null values, is repeated, contains independent or wrong data points and cannot be compared due to different characteristics. Therefore, before data analysis is performed, the acquired data of the plurality of nodes needs to be preprocessed so as to obtain preprocessed data, and preparation is made for subsequent analysis.
Step S140: and determining a data recovery priority list according to the preprocessed first evaluation data set and a service sensitivity evaluation model, wherein the service sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the preprocessed first evaluation data set, and the data recovery priority list is used for representing the priority order of the data to be recovered.
In the embodiment of the application, the data block access characteristics and the service sensitivity characteristics in the preprocessed first evaluation data set are input into the service sensitivity evaluation model, the service sensitivity evaluation model performs prediction based on a service sensitivity comprehensive evaluation algorithm, and the data recovery priority list is output.
Optionally, the data block access characteristics include one or more of the following: file name, file operation type, file operation time and file operation authority. And after the access characteristics of the data blocks are determined, the access characteristics of the data blocks are acquired and stored in real time.
Optionally, the traffic sensitivity characteristics of the data to be recovered include one or more of the following characteristics: the priority feature of the business system to which the data block belongs, the priority feature of the functional module to which the data block belongs, the business attribute feature of the data block, and the timeliness feature of the data block.
Illustratively, the priority characteristics of the service systems to which the data block belongs include a service system with high access frequency, a service system with higher access frequency, a service system with low access frequency, a service system with little access, and the like. The priority of the service system with high access frequency is higher than that of the service system with little access. The higher the priority corresponding to the service system to which the data block belongs, the more favorable the data restoration priority is, the higher the priority value is given to the data.
Illustratively, the priority characteristics of the functional modules to which the data blocks belong include a functional module with high use frequency, a functional module with low use frequency, a functional module with little use, and the like. Likewise, the function module whose use frequency is high has a priority greater than that of the function module that is hardly used. The higher the priority corresponding to the function module to which the data block belongs is, the more beneficial the data is to be endowed with a higher priority value when the data recovery priority is comprehensively evaluated.
Illustratively, the business attribute characteristics of the data blocks include process data and result data.
Illustratively, the timeliness characteristic of the data block includes data generation timeliness, typically years, quarters, months, days, hours, and the like.
After receiving the preprocessed first evaluation data set, the service sensitivity evaluation model firstly obtains data block access characteristics and service sensitivity characteristics of the data through the preprocessed first evaluation data set, then analyzes and mines the data block access characteristics and the service sensitivity characteristics of the data, and finally outputs a data recovery priority list, wherein the data recovery priority list is used for representing the data recovery priority list of the priority order of the data to be recovered.
The upper part of fig. 3 shows a schematic block diagram of a traffic sensitivity assessment model. As shown in fig. 3, the service sensitivity evaluation model may obtain a data recovery priority list by comprehensively evaluating the service sensitivity according to the priority characteristics of the service system to which the data block belongs, the priority characteristics of the function module to which the data block belongs, the service attribute characteristics of the data block, the timeliness characteristics of the data block, and the data heat value. Wherein the data heat value is obtained by calculating data block access characteristics through a neural network algorithm.
It should be understood that the above description is only exemplary, and the business sensitivity evaluation model may have other naming modes, for example, a business sensitivity comprehensive evaluation model. In this regard, the embodiments of the present application are not limited thereto.
It should be understood that the above description is only exemplary of the data recovery priority list, and the data recovery priority list may have other naming modes, for example, a data disaster recovery priority list. In this regard, the embodiments of the present application are not limited thereto.
In the embodiment of the application, after the service sensitivity evaluation model calculates the heat value of the data to be recovered according to the preprocessed data, the data recovery priority list is determined on the premise of service-based sensitivity. That is to say, by means of preferentially recovering data which is high in service sensitivity and is in urgent need of recovery in the service system, the effect of reducing the user fault perception degree can be achieved.
Step S150: and determining a node priority list according to the preprocessed second evaluation data set and a node state evaluation model, wherein the node state evaluation model is used for evaluating the real-time state of each node according to the preprocessed second evaluation data set and outputting the node priority list, the node priority list comprises at least one node, the at least one node is a node of the plurality of nodes, the state of the node meets a preset condition, and the node priority list is used for representing the priority order of the at least one node.
Optionally, as an implementation manner, the node satisfying the preset condition refers to a node whose state evaluation value is greater than a preset threshold among nodes as a node for recovering data or processing data on the basis of the node historical state evaluation value according to the real-time state evaluation value of each node calculated by the node state evaluation model. The preset condition is that the state evaluation value is greater than a preset threshold value.
It is to be understood that the above examples related to the preset conditions are only exemplary descriptions, and the embodiments of the present application are not limited thereto. In fact, the preset condition may be defined in other manners according to different requirements. The embodiment of the present application is not limited to a specific manner for defining the preset condition.
And the node state evaluation model is used for evaluating the real-time state of each node according to the preprocessed second evaluation data set and outputting the node priority list, wherein the node priority list comprises at least one node meeting preset conditions.
The lower part of fig. 3 shows a schematic block diagram of the node state evaluation model. The node state evaluation model shown in the lower part of fig. 3: and determining a node real-time state evaluation value based on a state evaluation algorithm according to the node state parameters, and performing service sensitivity comprehensive evaluation on the node real-time state evaluation value and the historical state evaluation value to obtain a data recovery priority list. It should be understood that the above description is only exemplary, and the node state evaluation model may also have other naming modes, for example, a big data distributed node comprehensive evaluation model. In this regard, the embodiments of the present application are not limited thereto.
It should be understood that the above description is only exemplary, and the node priority list may have other naming modes, such as a state-optimal node list. In this regard, the embodiments of the present application are not limited thereto.
And updating the real-time state evaluation value of the node in real time based on the node state evaluation model, so that the node which can be used for executing data recovery operation is favorably determined, the priority of the node is arranged, and the node with the optimal state in the node is ensured to receive the data which has high service sensitivity and needs to be recovered urgently in the service system. By the method, the advantages of the cluster can be greatly played, and the data disaster recovery efficiency is improved.
Step S160: determining a fault handling strategy according to the data recovery priority list and the node priority list, wherein the fault handling strategy is as follows: and sequentially sending the data to be recovered to the at least one node according to the data recovery priority list.
It should be understood that the above description of the failure handling policy is only an example, and the failure handling policy may have other naming modes, for example, a data disaster recovery backup policy. In this regard, the embodiments of the present application are not limited thereto.
Step S170: and executing the fault handling strategy.
Optionally, as an implementation manner, the data to be restored is sequentially sent to the at least one node for processing according to the priority order of the data to be restored and the priority order of the at least one node.
Illustratively, if there are 3 pieces of data to be restored, where the priority order of the data to be restored is 0, 1, and 2, and at least one node includes a node a, a node B, and a node C, where the priority order of the node is 0, 1, and 2, according to the priority order of the data to be restored, data with priority 2 in the data to be restored is first sent to the node C for processing, data with priority 1 in the data to be restored is sent to the node B for processing, and data with priority 0 is last sent to the node a for processing.
Optionally, as an implementation manner, after the data to be recovered is sequentially sent to the at least one node for processing according to the priority order of the data to be recovered, if remaining data still exists in the data to be recovered, the remaining data is sent to a node that has completed data processing in the at least one node, where the remaining data is data that is not processed by the at least one node in the data to be recovered.
Stated another way, when at least one node included in the node priority list cannot process all the data nodes to be recovered in the data recovery priority list at a time, or in other words, the data to be recovered cannot be in one-to-one correspondence with the at least one node. Therefore, part of the data which is urgently needed to be recovered is sent to the nodes first to execute the recovery operation, and the rest data is sent to the nodes which finish data processing in sequence according to the priority order. By the method, the data disaster recovery efficiency can be improved.
Optionally, as an implementation manner, the flow of the recovery operation includes:
the first step is as follows: all data block states are checked and listed.
The embodiment of the present application does not specifically limit the specific manner of checking and listing the states of all data blocks.
In one possible implementation, checking and listing all data block states may be accomplished by:
hdfs fsck File Absolute Path-files
It should be understood that the above description is made only by way of example and that the embodiments of the present application are not limited thereto.
The second step: and printing the position information and the information of the machine frame of the data block.
It should be understood that, the embodiment of the present application is not limited to a specific manner of checking and listing the statuses of all data blocks.
In one possible implementation, checking and listing all data block states may be accomplished by:
hdfs fsck File Absolute Path-files-blocks-locations
hdfs fsck File Absolute Path-files-blocks-locations-tracks
It should be understood that the above description is made only by way of example and that the embodiments of the present application are not limited thereto.
The third step: and exiting the name node security mode.
It should be understood that, the embodiment of the present application is not limited to a specific manner of checking and listing the statuses of all data blocks.
In one possible implementation, exiting the name node security mode may be implemented by:
hdfs dfsadmin-safemode leave NameNode
it should be understood that the above description is made only by way of example and that the embodiments of the present application are not limited thereto.
The fourth step: and setting the storage position of the data block copy.
It should be understood that the specific manner of checking and listing the states of all data blocks is not specifically limited in the embodiments of the present application.
In one possible implementation, checking and listing all data block states may be accomplished by:
ChooseTargetInOrder node for setting copy storage
It should be understood that the above four steps are only described as a flow example of the recovery operation, and the embodiments of the present application are not limited thereto.
In the embodiment of the application, by monitoring nodes in a distributed cluster system, when the nodes are determined to have faults, the fault type and the range of data to be recovered are determined; when the fault type is data node fault, acquiring state parameters on a plurality of nodes, service sensitivity characteristics of data to be recovered and data block access characteristics; then, determining a data recovery priority list according to the service sensitivity evaluation model; and determining a node priority list according to a node state evaluation model, and determining and executing a fault processing strategy according to the data recovery priority list and the node priority list. Compared with the prior art, the data recovery priority is dynamically updated based on the service sensitivity evaluation model, meanwhile, the data recovery operation is executed through real-time evaluation of the nodes and the data recovery priority, the effect that the data blocks which need to be recovered urgently are sent to the nodes with the optimal state to quickly execute the recovery operation can be achieved, and therefore the effect that the fault perception degree of a service system is reduced while the recovery efficiency of the data blocks is improved is achieved.
To facilitate understanding by those skilled in the art, the determination method of the data recovery priority list is described below in conjunction with the specific example in fig. 4. It should be understood that the method in fig. 4 is a refinement of step S140 in the data disaster recovery method shown in fig. 2 by way of example and not limitation, and the embodiment of the present application is not intended to be limited to the specific illustrative scenario.
Optionally, as a possible embodiment, the determining the data recovery priority list includes the following steps:
step S141: and determining the heat value of the data to be recovered according to the neural network algorithm and the data block access characteristics in the preprocessed first evaluation data set.
By way of example and not limitation, an LSTM neural network algorithm may be employed in embodiments of the present application to calculate a heat value of the data to be recovered. And calculating the corresponding heat value of the data to be recovered by the LSTM neural network algorithm according to the input data block access characteristics. The LSTM neural network algorithm has the advantages of processing long-distance sequence data and acquiring long-distance data information, and can transmit information processed at the current moment to the next moment for use. Therefore, the LSTM neural network algorithm can be used for predicting the heat value of the data in a future period of time based on the currently calculated heat value of the data and the heat value of historical data, and the problem that an HDFS heterogeneous storage strategy cannot be formulated due to the fact that the access heat of the file cannot be known in advance can be solved.
In one possible implementation, the data block access characteristic is determined according to one or more of the following parameters: file name, file operation type, file operation time and file operation authority.
In a possible implementation manner, the heat value of the data to be recovered can also be obtained through a recurrent network neural algorithm. The embodiment of the present application does not specifically limit the manner of determining the heat value of the data to be recovered.
Step S142: and determining the recovery priority of each data to be recovered according to the service sensitivity characteristic in the preprocessed first evaluation data set and the heat value of the data to be recovered. For example, in the embodiment of the present application, service sensitivity characteristics of data to be recovered and a heat value of the data to be recovered are comprehensively evaluated based on a multiple linear regression model, so as to determine a recovery priority of each data to be recovered.
Or, in another possible implementation manner, a fuzzy comprehensive evaluation method, a multi-factor comprehensive evaluation method, or the like may be adopted to perform service sensitivity comprehensive evaluation on the service sensitivity characteristic of the data to be recovered and the heat value of the data to be recovered.
It should be understood that, the above is described by taking a multiple linear regression model as an example, and other models or algorithms may also be used to perform comprehensive business sensitivity evaluation on the business sensitivity characteristics of the data to be recovered and the heat value of the data to be recovered, which is not limited in this embodiment of the application.
Step S143: and obtaining the data recovery priority list according to the recovery priority of the data to be recovered.
Illustratively, the data needing to be restored in the node is arranged according to the restoration urgency priority, the priority of the data needing to be restored urgently is marked with a larger value, and the priority of the data needing not to be restored urgently or slightly restored slowly is marked with a smaller value.
In summary, after the service sensitivity evaluation model receives the preprocessed data, firstly, the access characteristics of the data to be restored and the service sensitivity characteristics of the data to be restored are obtained through the preprocessed first evaluation data set, then, the access characteristics of the data blocks and the service sensitivity characteristics of the data are analyzed and mined, and finally, a data restoration priority list is output, wherein the data restoration priority list is used for representing the priority order of the data to be restored. After the data recovery priority list is obtained, according to the priority sequence of the data to be recovered, the data needing to be recovered can be sequentially sent to the nodes, specifically, the data which is urgently needed to be recovered can be preferentially sent to the nodes with the optimal state to perform the recovery operation, and the beneficial effect of improving the data disaster recovery efficiency can be obtained.
To facilitate understanding by those skilled in the art, the determination method of the node priority list is described below in conjunction with the specific example in fig. 5. It should be understood that the method in fig. 5 is a refinement to S150 in the data disaster recovery method shown in fig. 2 by way of example and not limitation, and the embodiments of the present application are not intended to be limited to the specific illustrative scenario.
In one possible implementation, the determining the node priority list includes the following steps:
step S151: and calculating the real-time state evaluation value of the node according to the preprocessed second evaluation data set.
Illustratively, in one possible implementation, the pre-processed second evaluation dataset includes state parameters that affect the real-time performance of the node. And calculating the real-time state evaluation value of the node according to the state parameters.
In particular, the status parameters include one or more of the following parameters: CPU utilization, memory utilization, disk I/O utilization, network bandwidth utilization, command response time, command queue length, and disk utilization. And after the state parameters influencing the real-time performance of the nodes are determined, the state parameters of the factors are stored in real time.
By way of example and not limitation, the state evaluation value of each node may be obtained based on a Prophet model in the embodiment of the present application. The Prophet model is a time series prediction model and can be used for fitting a time series with a nonlinear trend and predicting the trend of the time series with missing values and abnormal values. According to the input state parameters of the nodes, the Prophet model can calculate the real-time state evaluation value of the nodes.
It is understood that the above description is only given by taking the Prophet model as an example, and other models can be used to evaluate the state parameters of multiple nodes. This is not particularly limited by the embodiments of the present application.
Step S152: and obtaining the priority of the at least one node according to the calculated real-time state evaluation value of the node and the historical state evaluation value of the node.
It should be understood that after the real-time state evaluation value of the node and the historical state evaluation value of the node are comprehensively evaluated, the node is marked with the priority on the premise that a preset condition is satisfied, that is, the evaluation value in the node is greater than a preset threshold value, and the node priority list can be entered to perform the recovery operation on the data.
Step S153: and determining a node priority list according to the priority of the node.
Specifically, in the node priority list, the nodes are arranged in order of priority from large to small. Wherein nodes of greater priority receive data first to perform recovery operations. The method can further improve the data disaster recovery efficiency.
The real-time states of the nodes in the distributed cluster are evaluated based on the node state evaluation model, so that a node priority list which can be used for recovering data or processing data can be obtained on the basis of fully considering the difference of the clusters. For at least one node included in the node priority list, after receiving the data to be recovered, the data can be rapidly recovered, so as to further improve the data disaster recovery rate.
To facilitate understanding by those skilled in the art, the method of data pre-processing is described below in conjunction with the specific example in FIG. 6. It should be understood that the method in fig. 6 is a refinement to S130 in the data disaster recovery method shown in fig. 2 by way of example and not limitation, and the embodiments of the present application are not intended to be limited to the specific illustrative scenarios.
For step S130, the embodiment of the present application does not specifically limit the specific operation of the preprocessing.
Optionally, in one implementation, the data is preprocessed by null processing, data normalization (normalization), and class-type feature processing. Wherein the pre-processing operation comprises the steps of:
step S131: and performing class type feature processing on non-numerical data in the data of the first evaluation data set to obtain data subjected to class type feature processing, wherein the class type feature processing is used for converting the non-numerical data into numerical data.
The embodiment of the present application does not specifically limit the specific manner of processing the category-type features.
In one possible implementation, category characterization may be used to perform Category characterization on the data in the first evaluation dataset. Wherein, categorical is a data type in Python software, and can allocate the numerical value to a limited group of discrete categories, which can improve the high-efficiency data storage and convenient non-numerical data processing, and at the same time, keep the meaningful name for the numerical value.
Illustratively, the contained class-type features are first screened out using Python software, and then category-type features in the first evaluation dataset are converted into numerical features using category software.
It should be understood that the above description is only made by taking category type feature processing as an example, and the embodiments of the present application are not limited thereto. In fact, other means may be used instead of Categorical.
Optionally, in an implementation, the first evaluation data set may also be subjected to class-type feature processing by any one of the following ways: serial number encoding, one-hot encoding, binary encoding.
Step S132: and performing null processing on the numerical data to obtain null-processed data, wherein the numerical data comprises original numerical data in the first evaluation data set and numerical data subjected to the category-type feature processing in the step S131.
In other words, assuming that a data set with three features (which can be respectively characterized by three fields) of a, B, and C exists in the first evaluation data set, where feature a is a non-numerical feature, and feature B and feature C are numerical features, feature a is converted from the non-numerical feature to the numerical feature, and then all three features a, B, and C are subjected to null processing to obtain null-processed data.
The embodiment of the present application does not specifically limit the specific manner of null value processing.
In a possible implementation manner, data lines containing null values may be screened out by using Python software, and interpolation processing is performed on the null values of the data set by using a lagrange interpolation method, where an interpolation function is as follows:
Figure BDA0003927687280000141
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003927687280000142
where n represents the total number of rows of the data set, x represents the location of the null value, x i 、x j Representing the value (or position) of an independent variable, ω i Representing the weight of the center of gravity, y i Denotes that the argument (or position) is x i The value of time.
It should be understood that the foregoing is only described by taking the lagrange interpolation method as an example of the class-type feature processing, and the embodiments of the present application are not limited thereto. In fact, other approaches may be used instead of lagrange interpolation.
Optionally, in an implementation, the null value of the data set may also be interpolated by a newton interpolation method, a nearest neighbor algorithm (KNN interpolation method), and a modified method thereof. In this regard, the embodiments of the present application are not particularly limited.
Step S133: and carrying out data normalization processing on the data after null processing to obtain the preprocessed data.
The embodiment of the present application does not specifically limit the specific manner of data normalization processing.
In one possible implementation, the null-processed data may be normalized using z-score normalization in Python software, which is performed as follows:
Figure BDA0003927687280000151
wherein u represents the mean of the null-value processed data, σ represents the standard deviation of the null-value processed data, z represents the result of the normalization process, and x represents the null-value processed data, that is, the data before the normalization process.
It should be understood that the above description is only made by taking z-score normalization as an example for the class-type feature processing, and the embodiments of the present application are not limited thereto. In fact, other means may be employed instead of z-score normalization.
Optionally, in an implementation manner, the null-processed data may also be subjected to data normalization processing by normalizing min-max normalization, sigmoid function, or the like. In this regard, the embodiments of the present application are not particularly limited.
And carrying out data normalization processing on the data after null processing, thereby avoiding dimension influence and overfitting problems among data sample characteristics.
It should be understood that the preprocessing operation on the first evaluation data set in fig. 6 is only for facilitating understanding of S130 in the data disaster recovery method shown in fig. 2 by those skilled in the art, and the preprocessing operation is not limited to the first evaluation data set applicable only to the embodiment of the present application. Likewise, the preprocessing operation in fig. 6 is also applicable to the second evaluation data set in the embodiment of the present application. For brevity, details of the embodiments of the present application are not repeated.
To facilitate understanding for those skilled in the art, the data disaster recovery method provided in the present application is described below with reference to a specific example in fig. 7. It should be understood that the example in fig. 7 is only for facilitating understanding of the data disaster recovery method provided in the embodiment of the present application by those skilled in the art, and the embodiment of the present application is not limited to the illustrated specific scenario. It will be apparent to those skilled in the art that various equivalent modifications or variations are possible in light of the example shown in fig. 7, and such modifications or variations are intended to be included within the scope of the embodiments of the present application.
Fig. 7 is a flowchart of an example of a data disaster recovery method according to an embodiment of the present application. It is to be understood that the method of fig. 7 may be applied to the application scenario shown in fig. 1 by way of example and not limitation. The related terms or explanations referred to in fig. 7 may refer to the foregoing description, and are not repeated below. As shown in fig. 7, the method specifically includes the following steps:
s710: and (5) node fault early warning.
Exemplarily, after a user configures an open-source big data monitoring platform to monitor the heterogeneous big data distributed cluster system, the user can monitor that a node in the distributed cluster system fails. When it is monitored that the node in the distributed cluster system fails, S720 is performed.
S720, the node judges whether the fault type is the data node fault.
S730: and when the fault type is the data node fault type, acquiring a first evaluation data set and a second evaluation data set.
S740: preprocessing the first evaluation data set to obtain a preprocessed first evaluation data set, and preprocessing the second evaluation data set to obtain a preprocessed second evaluation data set.
Regarding the data preprocessing process, reference may be made to the description in step S130 and fig. 6, and for brevity, the description is not repeated here.
S750: and determining a data recovery priority list according to the service sensitivity evaluation model and the preprocessed first evaluation data set, and determining a node recovery priority list according to the node state evaluation model and the preprocessed second evaluation data set.
For the process of determining the data recovery priority list, reference may be made to the description in step S140 and fig. 4, and for brevity, the description is not repeated here.
For the process of determining the node restoration priority list, reference may be made to the foregoing description in step S150 and fig. 5, and details are not repeated here for brevity.
S760: and determining a fault handling strategy according to the data recovery priority list and the node recovery priority list.
S770: and executing the fault handling strategy.
If the fault type is not the data node fault type, namely the fault type is name node fault, the data corresponding to the node is not acquired, and a fault processing strategy is directly formulated. The failure handling policy for a name node failure directly causes the name node to exit the secure mode. The failure handling policy is not identical to the failure handling policy in S760 described above.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 8 shows a structural block diagram of a data disaster recovery device provided in the embodiment of the present application, and for convenience of description, only the parts related to the embodiment of the present application are shown.
Referring to fig. 8, the data disaster recovery apparatus 800 is applied to a distributed cluster system, where the distributed cluster system includes a plurality of nodes, and the apparatus 800 includes: a determination module 810, an acquisition module 820, a preprocessing module 830, and an execution module 840.
In some possible implementations, the determining module 810 is configured to determine a fault type and a range of data to be recovered when it is determined that a node in the distributed cluster system fails.
In some possible implementations, the obtaining module 820 is configured to obtain, when the fault type is a data node fault type, a first evaluation data set and a second evaluation data set, where the first evaluation data set includes a traffic sensitivity characteristic and/or a data block access characteristic of the data to be recovered, and the second evaluation data set includes status parameters of nodes other than a faulty data node in the plurality of nodes;
in some possible implementations, the preprocessing module 830 is configured to preprocess the first evaluation data set acquired by the acquiring module 820 to obtain a preprocessed first evaluation data set, and preprocess the second evaluation data set to obtain a preprocessed second evaluation data set;
optionally, the pre-processing operation comprises:
the method comprises the steps of class type feature processing, null value processing and data normalization processing, wherein the class type feature processing is used for converting non-numerical data into numerical data.
In some possible implementations, the determining module 810 is further configured to determine a data recovery priority list according to the preprocessed first evaluation data set and a service sensitivity evaluation model, where the service sensitivity evaluation model is configured to output a priority of data to be recovered according to the preprocessed first evaluation data set, and the data recovery priority list is configured to represent a priority order of the data to be recovered;
optionally, the traffic sensitivity characteristics of the data to be recovered include one or more of the following characteristics: the priority feature of the business system to which the data block belongs, the priority feature of the functional module to which the data block belongs, the business attribute feature of the data block, and the timeliness feature of the data block.
Optionally, the service sensitivity evaluation model is configured to output a priority of data to be recovered according to the preprocessed first evaluation data set, and includes:
the service sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the heat value characteristics of the data to be recovered and the service sensitivity characteristics of the data blocks;
and the heat value characteristic of the data to be recovered is obtained by analyzing the access characteristic of the data block based on a neural network algorithm.
Optionally, the data block access characteristic is determined according to one or more of the following parameters: file name, file operation type, file operation time and file operation authority.
In some possible implementations, the determining module 810 is further configured to determine a node priority list according to the preprocessed second evaluation data set and a node state evaluation model, where the node state evaluation model is configured to evaluate a real-time state of each node according to the preprocessed second evaluation data set and output the node priority list, the node priority list includes at least one node, the at least one node is a node of the plurality of nodes whose state meets a preset condition, and the node priority list is used to characterize a priority order of the at least one node;
optionally, the node state evaluation model is configured to evaluate a real-time state of each node according to the preprocessed second evaluation data set, and output the node priority list, and includes:
the node state evaluation model is used for determining the node priority list according to the state parameters of all the nodes and the historical state evaluation values of all the nodes;
wherein the status parameters include one or more of the following parameters: CPU utilization, memory utilization, disk I/O utilization, network bandwidth utilization, command response time, command queue length, and disk utilization.
In some possible implementations, the determining module 810 is further configured to determine a failure handling policy according to the data recovery priority list and the node priority list, where the failure handling policy is: and sequentially sending the data to be recovered to the at least one node according to the data recovery priority list.
In some possible implementations, the execution module 840 is further configured to execute the failure handling policy.
Optionally, the executing the fault handling policy includes:
and sequentially sending the data to be recovered to the at least one node for processing according to the priority order of the data to be recovered and the priority order of the at least one node.
Optionally, after sequentially sending the data to be restored to the at least one node for processing according to the priority order of the data to be restored and the priority order of the at least one node, the method further includes:
and if the data to be recovered still has residual data, sending the residual data to a node which is finished data processing in the at least one node, wherein the residual data refers to data which is not processed by the at least one node in the data to be recovered.
It should be understood that a data disaster recovery device according to the embodiment of the present application may correspond to the foregoing method embodiment, and the foregoing and other management operations and/or functions of each module in the data disaster recovery device are respectively for implementing corresponding steps of the method in the foregoing method embodiment, so that beneficial effects in the foregoing method embodiment may also be implemented, and for the sake of brevity, details are not described here.
Fig. 9 is a schematic structural diagram of data disaster recovery according to an embodiment of the present application. As shown in fig. 9, the apparatus 900 includes: at least one processor 90 (only one is shown in fig. 9), a memory 91, and a computer program 92 stored in the memory 91 and executable on the at least one processor 90, wherein the processor 90, when executing the computer program 92, is configured to implement the steps in any of the above-mentioned method embodiments for data disaster recovery (such as the method in fig. 2 to the method in fig. 7).
The apparatus 900 may be a computing device such as a desktop computer, a notebook, a palm top computer, and a cloud server. The data disaster recovery device may include, but is not limited to, a processor 90 and a memory 91. Those skilled in the art will appreciate that fig. 9 is only an example of the apparatus 900 for data disaster recovery, and does not constitute a limitation to the apparatus 900, and may include more or less components than those shown in the drawings, or combine some components, or different components, for example, and may further include an input/output device, a network access device, and the like.
The Processor 90 may be a Central Processing Unit (CPU), and the Processor 90 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 91 may be an internal storage unit of the apparatus 900 in some embodiments, for example, a hard disk or a memory of the apparatus 900 for data disaster recovery. The memory 91 may also be an external storage device of the apparatus 900 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the apparatus 900. Further, the memory 91 may also include both an internal storage unit and an external storage device of the apparatus 900. The memory 91 is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of the computer programs. The memory 91 may also be used to temporarily store data that has been output or is to be output.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. For the specific working processes of the units and modules in the system, reference may be made to the corresponding processes in the foregoing method embodiments, which are not described herein again.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above may be implemented by instructing relevant hardware by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the embodiments of the methods described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-drive, a removable hard drive, a magnetic or optical disk, etc. In some jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and proprietary practices.
In the above embodiments, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described or recited in any embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A data disaster recovery method is applied to a distributed cluster system, wherein the distributed cluster system comprises a plurality of nodes, and the method comprises the following steps:
when determining that a node in the distributed cluster system fails, determining a failure type and a range of data to be recovered;
when the fault type is a data node fault type, acquiring a first evaluation data set and a second evaluation data set, wherein the first evaluation data set comprises the service sensitivity characteristic and/or the data block access characteristic of the data to be recovered, and the second evaluation data set comprises the state parameters of nodes except for the fault data node in the plurality of nodes;
preprocessing the first evaluation data set to obtain a preprocessed first evaluation data set, and preprocessing the second evaluation data set to obtain a preprocessed second evaluation data set;
determining a data recovery priority list according to the preprocessed first evaluation data set and a service sensitivity evaluation model, wherein the service sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the preprocessed first evaluation data set, and the data recovery priority list is used for representing the priority order of the data to be recovered;
determining a node priority list according to the preprocessed second evaluation data set and a node state evaluation model, wherein the node state evaluation model is used for evaluating the real-time state of each node according to the preprocessed second evaluation data set and outputting the node priority list, the node priority list comprises at least one node, the at least one node is a node of the plurality of nodes, the state of the node meets a preset condition, and the node priority list is used for representing the priority order of the at least one node;
determining a fault handling strategy according to the data recovery priority list and the node priority list, wherein the fault handling strategy is as follows: according to the data recovery priority list, sequentially sending data to be recovered to the at least one node;
and executing the fault handling strategy.
2. The method of claim 1, wherein the traffic sensitivity characteristics of the data to be recovered comprise one or more of the following characteristics: the priority feature of the business system to which the data block belongs, the priority feature of the functional module to which the data block belongs, the business attribute feature of the data block, and the timeliness feature of the data block.
3. The method according to claim 1 or 2, wherein the business sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the preprocessed first evaluation data set, and comprises:
the service sensitivity evaluation model is used for outputting the priority of the data to be recovered according to the heat value characteristics of the data to be recovered and the service sensitivity characteristics of the data to be recovered;
and the heat value characteristic of the data to be recovered is obtained by analyzing the access characteristic of the data block based on a neural network algorithm.
4. The method of claim 3, wherein the data block access characteristic is determined according to one or more of the following parameters: file name, file operation type, file operation time and file operation authority.
5. The method of claim 1, wherein the node state evaluation model is configured to evaluate a real-time state of each node according to the preprocessed second evaluation data set and output the node priority list, and comprises:
the node state evaluation model is used for determining the node priority list according to the state parameters of all the nodes and the historical state evaluation values of all the nodes;
wherein the status parameters include one or more of the following parameters: CPU utilization, memory utilization, disk I/O utilization, network bandwidth utilization, command response time, command queue length, and disk utilization.
6. The method of claim 1, wherein the executing the fault handling policy comprises:
and sequentially sending the data to be recovered to the at least one node for processing according to the priority order of the data to be recovered and the priority order of the at least one node.
7. The method of claim 6, wherein after the data to be recovered is sequentially sent to the at least one node for processing according to the priority order of the data to be recovered and the priority order of the at least one node, the method further comprises:
and if the data to be recovered still has residual data, sending the residual data to a node which is finished data processing in the at least one node, wherein the residual data refers to data which is not processed by the at least one node in the data to be recovered.
8. The method of claim 1, wherein the pre-processing operation comprises one or more of:
the method comprises the steps of class type feature processing, null value processing and data normalization processing, wherein the class type feature processing is used for converting non-numerical data into numerical data.
9. A data disaster recovery apparatus, wherein the apparatus is applied to a distributed cluster system, the distributed cluster system includes a plurality of nodes, and the apparatus includes:
the determining module is used for determining the fault type and the range of the data to be recovered when the node in the distributed cluster system is determined to have a fault;
an obtaining module, configured to obtain a first evaluation data set and a second evaluation data set when the fault type is a data node fault type, where the first evaluation data set includes a service sensitivity characteristic and/or a data block access characteristic of the data to be recovered, and the second evaluation data set includes state parameters of nodes other than a faulty data node in the plurality of nodes;
the preprocessing module is used for preprocessing the first evaluation data set to obtain a preprocessed first evaluation data set and preprocessing the second evaluation data set to obtain a preprocessed second evaluation data set;
the determining module is further configured to determine a data recovery priority list according to the preprocessed first evaluation data set and a service sensitivity evaluation model, where the service sensitivity evaluation model is configured to output a priority of data to be recovered according to the preprocessed first evaluation data set, and the data recovery priority list is configured to represent a priority order of the data to be recovered;
the determining module is further configured to determine a node priority list according to the preprocessed second evaluation data set and a node state evaluation model, where the node state evaluation model is configured to evaluate a real-time state of each node according to the preprocessed second evaluation data set and output the node priority list, the node priority list includes at least one node, the at least one node is a node whose state satisfies a preset condition among the multiple nodes, and the node priority list is used to represent a priority order of the at least one node;
the determining module is further configured to determine a fault handling policy according to the data recovery priority list and the node priority list, where the fault handling policy is: sequentially sending the data to be recovered to the at least one node according to the data recovery priority list;
and the execution module is used for executing the fault processing strategy.
10. An apparatus for disaster recovery of data, comprising a processor and a memory, the processor and the memory being coupled, the memory being configured to store a computer program which, when executed by the processor, causes the apparatus to perform the method of any of claims 1 to 8.
CN202211378302.3A 2022-11-04 2022-11-04 Data disaster tolerance recovery method and device Pending CN115718674A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211378302.3A CN115718674A (en) 2022-11-04 2022-11-04 Data disaster tolerance recovery method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211378302.3A CN115718674A (en) 2022-11-04 2022-11-04 Data disaster tolerance recovery method and device

Publications (1)

Publication Number Publication Date
CN115718674A true CN115718674A (en) 2023-02-28

Family

ID=85254887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211378302.3A Pending CN115718674A (en) 2022-11-04 2022-11-04 Data disaster tolerance recovery method and device

Country Status (1)

Country Link
CN (1) CN115718674A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112311A (en) * 2023-10-18 2023-11-24 苏州元脑智能科技有限公司 I/O driven data recovery method, system and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112311A (en) * 2023-10-18 2023-11-24 苏州元脑智能科技有限公司 I/O driven data recovery method, system and device
CN117112311B (en) * 2023-10-18 2024-02-06 苏州元脑智能科技有限公司 I/O driven data recovery method, system and device

Similar Documents

Publication Publication Date Title
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
CN103513983B (en) method and system for predictive alert threshold determination tool
US11269718B1 (en) Root cause detection and corrective action diagnosis system
EP3567496B1 (en) Systems and methods for indexing and searching
US11237813B1 (en) Model driven state machine transitions to configure an installation of a software program
US9436539B2 (en) Synchronized debug information generation
CN113705981B (en) Big data based anomaly monitoring method and device
CN115718674A (en) Data disaster tolerance recovery method and device
US11468365B2 (en) GPU code injection to summarize machine learning training data
CN116701033A (en) Host switching abnormality detection method, device, computer equipment and storage medium
CN107729217A (en) A kind of database abnormality eliminating method and terminal
US10248508B1 (en) Distributed data validation service
CN116048817B (en) Data processing control method, device, computer equipment and storage medium
WO2021067385A1 (en) Debugging and profiling of machine learning model training
CN113783862B (en) Method and device for checking data in edge cloud cooperation process
CN113849341A (en) Performance optimization method, system and device of NAS snapshot and readable storage medium
CN111949281A (en) Database installation method based on AI configuration, user equipment and storage medium
CN113656207B (en) Fault processing method, device, electronic equipment and medium
CN117851449A (en) Processing method, medium and computer equipment for database pre-written log
CN116955115A (en) Host performance monitoring method, device, computer equipment and storage medium
US20230065833A1 (en) Maintaining ongoing transaction information of a database system in an external data store for processing unsaved transactions in response to designated events
CN117312070A (en) Intelligent operation and maintenance alarm converging method and device
CN116932312A (en) Application data processing method, device, computer equipment and storage medium
CN112737839A (en) Method and equipment for self-adaptive fault repair in multi-public cloud environment
CN117762331A (en) Method and system for automatically updating configuration in distributed storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination