CN116483603A - Fault processing method and device of distributed system, storage medium and electronic equipment - Google Patents

Fault processing method and device of distributed system, storage medium and electronic equipment Download PDF

Info

Publication number
CN116483603A
CN116483603A CN202310423582.3A CN202310423582A CN116483603A CN 116483603 A CN116483603 A CN 116483603A CN 202310423582 A CN202310423582 A CN 202310423582A CN 116483603 A CN116483603 A CN 116483603A
Authority
CN
China
Prior art keywords
data
target
application
fault
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310423582.3A
Other languages
Chinese (zh)
Inventor
孙才婵
郑海青
唐月标
黄镜澄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310423582.3A priority Critical patent/CN116483603A/en
Publication of CN116483603A publication Critical patent/CN116483603A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault processing method and device of a distributed system, a storage medium and electronic equipment, and relates to the field of distribution. Wherein the method comprises the following steps: acquiring first alarm information, and analyzing the first alarm information to obtain a target association relationship, wherein the target association relationship is formed by defining unified metadata of a plurality of distributed systems with different functions; generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises position information of an application with a fault in the distributed system; and processing the application with the fault according to the position information. The invention solves the technical problem of low fault locating efficiency caused by inconsistent basic data used by the distributed systems with different functions in the prior art when the fault problem is detected by the distributed monitoring system.

Description

Fault processing method and device of distributed system, storage medium and electronic equipment
Technical Field
The present invention relates to the field of distributed systems, and in particular, to a method and apparatus for fault handling in a distributed system, a storage medium, and an electronic device.
Background
In a distributed system, a plurality of different system architectures are often involved to support running and calling of distributed services, and there are an infrastructure cloud IaaS system, an application platform cloud PaaS system, a distributed calling system, a distributed batch system, a distributed storage system, a distributed monitoring system and the like.
At present, each system is independently deployed and operated, different basic definition data are used among different systems, for a distributed monitoring operation and maintenance system, when the distributed monitoring system monitors each distributed application, real-time synchronous monitoring among different systems cannot be achieved, particularly when the application is abnormally positioned through troubleshooting of the distributed monitoring system, application nodes are required to be positioned step by step according to IP addresses and the like, the positioning process is complex, the time consumption is relatively large, and the fault positioning efficiency is low.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a fault processing method, a device, a storage medium and electronic equipment of a distributed system, which at least solve the technical problem of low fault positioning efficiency when a distributed monitoring system is used for troubleshooting a fault problem due to inconsistent basic data used by the distributed systems with different functions in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a fault handling method of a distributed system, including: acquiring first alarm information, and analyzing the first alarm information to obtain a target association relationship, wherein the target association relationship is generated by defining metadata of a plurality of distributed systems with different functions; generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises position information of an application with a fault in the distributed system; and processing the application with the fault according to the position information.
Further, generating fault location information corresponding to the first alarm information according to the target association relation, including: determining target data of the failed application from a plurality of distributed systems with different functions according to the target association relation; and generating fault positioning information corresponding to the first alarm information according to the target data.
Further, before the first alarm information is acquired, the method further includes: determining a target data structure according to the data type of first data of an application in the distributed system, wherein the first data represents basic information of the application; according to the target data structure, metadata definition is carried out on the first data in the target component, and target metadata is generated, wherein the target metadata are used for describing attribute information of the application; and transmitting the target metadata to a plurality of distributed systems with different functions through the target component to generate a target association relation.
Further, before metadata definition is performed on the first data in the target component according to the target data structure, the method further includes: in the process of deploying the PaaS platform associated with the distributed system, the data value of the first data is injected into a first environment variable of the PaaS platform; in the process of deploying the IaaS platform associated with the distributed system, the data value of the first data is injected into a second environment variable of the IaaS platform.
Further, according to the target data structure, metadata definition is performed on the first data in the target component, and target metadata is generated, including: reading, by the target component, the data value in the first environment variable and the data value in the second environment variable; and assigning the data parameters in the target data structure according to the data values in the first environment variable and the data values in the second environment variable to generate target metadata.
Further, the target data structure includes a first data structure for defining application node information of the application, a second data structure for defining container deployment information of the application, and a third data structure for defining physical location information of the application.
Further, the data parameters in the first data structure include at least one of: the application name, cluster name, service node name, the data parameters in the second data structure include at least one of: the container ID, container name, and the data parameters in the third data structure include at least one of: the area to which the application belongs, the application IP, the campus to which the application belongs.
According to another aspect of the embodiment of the present invention, there is also provided a fault handling apparatus of a distributed system, including: the acquisition module is used for acquiring the first alarm information and analyzing the first alarm information to obtain a target association relationship, wherein the target association relationship is generated by defining metadata of a plurality of distributed systems with different functions; the generation module is used for generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises the position information of the application with the fault in the distributed system; and the processing module is used for processing the failed application according to the position information.
According to another aspect of the embodiments of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described fault handling method of a distributed system when run.
According to another aspect of an embodiment of the present invention, there is also provided an electronic device including one or more processors; and a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method for running the program, wherein the program is configured to perform the fault handling method of the distributed system described above when run.
In the embodiment of the invention, the rapid fault location is realized by unifying metadata of each distributed system, first alarm information is acquired, the first alarm information is analyzed to obtain a target association relationship, then fault location information corresponding to the first alarm information is generated according to the target association relationship, and then the application with faults is processed according to the position information. The target association relation is generated by defining metadata of a plurality of distributed systems with different functions, and the fault locating information at least comprises position information of an application with a fault in the distributed systems.
In the process, a data basis is provided for fault analysis by acquiring the first alarm information, and the target association relationship can be obtained by analyzing the first alarm information, so that fault positioning information corresponding to the first alarm information can be generated according to the target association relationship, the timely positioning of the fault application in the distributed system is realized, the fault positioning efficiency is improved, the current fault can be timely solved, and the reliability of the system is improved.
Therefore, through the technical scheme of the invention, the purpose of carrying out operation, maintenance and positioning analysis can be carried out rapidly after the problem of application operation in any one of the distributed systems is solved by unifying the metadata of each distributed system, so that the technical effect of improving the fault positioning efficiency is realized, and the technical problem of low fault positioning efficiency caused by inconsistent basic data used by the distributed systems with different functions in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of an alternative fault handling method for a distributed system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of metadata of an alternative unified distribution system according to an embodiment of the invention;
FIG. 3 is a flow chart of an alternative parameter setting of metadata in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of an alternative metadata data structure in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of data parameters of an alternative application-related data structure, according to an embodiment of the invention;
FIG. 6 is a schematic diagram of data parameters of an alternative container-related data structure, in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of data parameters of an alternative physical deployment-related data structure, in accordance with an embodiment of the present invention;
FIG. 8 is a schematic diagram of an alternative distributed system fault handling apparatus according to an embodiment of the present invention;
fig. 9 is a schematic diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, the related information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present invention are information and data authorized by the user or sufficiently authorized by each party. For example, an interface is provided between the system and the relevant user or institution, before acquiring the relevant information, the system needs to send an acquisition request to the user or institution through the interface, and acquire the relevant information after receiving the consent information fed back by the user or institution.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a fault handling method for a distributed system, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system, such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different than what is shown or described herein.
FIG. 1 is a flow chart of an alternative fault handling method for a distributed system, as shown in FIG. 1, according to an embodiment of the present invention, the method comprising the steps of:
step S101, first alarm information is obtained, and the first alarm information is analyzed to obtain a target association relationship, wherein the target association relationship is generated by defining metadata of a plurality of distributed systems with different functions.
In the above step, the first alarm information may be obtained by an application system, a processor, an electronic device, or the like. Alternatively, the first alarm information is acquired through the distributed monitoring system, and the first alarm information may be information generated when an abnormality occurs in an application in the distributed system. Optionally, the target association relationship may be a data association relationship of an application between a plurality of distributed systems with different functions, for example, the data association relationship of the application a is generated according to data such as a host, an IP address, a virtual machine, service analysis, exception analysis, call volume analysis, and the like of the application a. Optionally, the plurality of functionally distinct distributed systems includes at least a distributed invocation system, a distributed monitoring system, a distributed batch system, a distributed storage system, and a distributed log platform.
Step S102, generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises position information of an application with a fault in the distributed system.
Specifically, according to the target association relationship, fault location information corresponding to the first alarm information may be generated, for example, it is assumed that an application node C in a cluster B of the application a fails, according to data such as service analysis, anomaly analysis, call volume analysis and the like in the data association relationship of the application a, the fault location information may be quickly located to the application node C, and according to the data association relationship, information such as a fault cause of the application node C may be quickly obtained.
Step S103, processing the application with the fault according to the position information.
Specifically, according to the position information, the failed application can be processed, so that the current failure is solved in time, the time cost is saved, and the reliability of the system can be improved.
Based on the above-mentioned schemes defined in step S101 to step S103, it can be known that in the embodiment of the present invention, quick fault location is implemented by unifying metadata of each distributed system, first alarm information is acquired first, and the first alarm information is analyzed to obtain a target association relationship, then fault location information corresponding to the first alarm information is generated according to the target association relationship, and then a faulty application is processed according to the location information. The target association relation is generated by defining metadata of a plurality of distributed systems with different functions, and the fault locating information at least comprises position information of an application with a fault in the distributed systems.
It is easy to notice that in the above process, by acquiring the first alarm information, a data basis is provided for fault analysis, and by analyzing the first alarm information, a target association relationship can be obtained, so that fault positioning information corresponding to the first alarm information can be generated according to the target association relationship, timely positioning of fault application in the distributed system is realized, the fault positioning efficiency is improved, the current fault can be timely solved, and the reliability of the system is improved.
Therefore, through the technical scheme of the invention, the purpose of carrying out operation, maintenance and positioning analysis can be carried out rapidly after the problem of application operation in any one of the distributed systems is solved by unifying the metadata of each distributed system, so that the technical effect of improving the fault positioning efficiency is realized, and the technical problem of low fault positioning efficiency caused by inconsistent basic data used by the distributed systems with different functions in the prior art is solved.
In an alternative embodiment, generating fault location information corresponding to the first alarm information according to the target association relationship includes: determining target data of the failed application from a plurality of distributed systems with different functions according to the target association relation; and generating fault positioning information corresponding to the first alarm information according to the target data.
Specifically, in the process of generating the fault location information corresponding to the first alarm information according to the target association relationship, first, target data of the application with the fault is determined from a plurality of distributed systems with different functions according to the target association relationship, and then the fault location information corresponding to the first alarm information can be generated according to the target data, and optionally, the target data can be position data, operation data and the like of the application.
It should be noted that, the metadata definition provides effective data support for the distributed monitoring operation and maintenance system, and because the distributed systems all use the same metadata definition, the systems have consistent data association relationship, so that after the problem of application operation in any one of the distributed systems is solved, operation and maintenance positioning analysis can be performed quickly, and the efficiency of fault positioning is improved.
In an alternative embodiment, before acquiring the first alarm information, determining a target data structure according to a data type of first data of an application in the distributed system, wherein the first data characterizes basic information of the application; according to the target data structure, metadata definition is carried out on the first data in the target component, and target metadata is generated, wherein the target metadata are used for describing attribute information of the application; and transmitting the target metadata to a plurality of distributed systems with different functions through the target component to generate a target association relation.
Alternatively, the first data may be basic data of an application, the data type may be application dependent, container dependent, physical deployment dependent, and the target component may be Commons (a software toolkit). In particular, the target data structure may be determined according to a data type of the underlying data of the application in the distributed system, e.g., application-related data structure, container-related data structure, physical deployment-related data structure.
Specifically, according to the target data structure, metadata definition is performed on the basic data in the common, so as to generate target metadata, fig. 2 is a schematic diagram of metadata of an optional unified distributed system according to an embodiment of the present invention, as shown in fig. 2, metadata definition is performed on the basic data in the common, so as to obtain a data structure entity component (i.e. common) defining metadata definition, which can generate target metadata, and then the target metadata is transmitted to multiple distributed systems (distributed calling systems, distributed monitoring systems, distributed batch systems, distributed storage systems, and distributed log platforms) with different functions through the common, so as to generate a target association relationship.
It should be noted that, the metadata is unified in the above process, and effective data support is provided for the distributed monitoring operation and maintenance system, so that after the problem of application operation in any one distributed system is solved, operation and maintenance positioning analysis can be rapidly performed, and the efficiency of fault positioning is improved.
In an alternative embodiment, before metadata definition is performed on the first data in the target component according to the target data structure and the target metadata is generated, the data value of the first data is injected into the first environment variable of the PaaS platform in the process of deploying the PaaS platform associated with the distributed system; in the process of deploying the IaaS platform associated with the distributed system, the data value of the first data is injected into a second environment variable of the IaaS platform.
Specifically, as shown in fig. 2, before metadata definition is performed on first data in a target component according to a target data structure and target metadata is generated, in the process of deploying a PaaS platform associated with a distributed system, referring to a format defined by the metadata, a data value of the first data (i.e., base data) is injected into an environment variable (i.e., a first environment variable) of the PaaS platform; in deploying the IaaS platform associated with the distributed system, the data values of the base data are injected into the environment variables (i.e., the second environment variables) of the IaaS platform with reference to the format defined by the metadata.
Specifically, paaS is called Platform as a Service, which is a business model of providing a service as a platform, saaS (Software as aService) is a service of providing a program via a network, and PaaS (Platform as aService) is a service of providing a server platform or a development environment.
Specifically, iaaS is generally called Infrastructure as a Service, which refers to infrastructure as a service, and is a service mode in which IT infrastructure is provided as a service to the outside through a network and charges according to the actual usage or occupancy of resources by a user. In this service mode, the general user does not construct a hardware facility such as a data center by himself, but obtains computer infrastructure services including services such as servers, storages and networks from the IaaS service provider through a renting manner by using the Internet.
It should be noted that, by injecting the data value of the first data into the first environment variable of the PaaS platform in the process of deploying the PaaS platform associated with the distributed system, and injecting the data value of the first data into the second environment variable of the IaaS platform in the process of deploying the IaaS platform associated with the distributed system, a data basis is provided for assigning the data parameters in the target data structure through the target component, so that after the problem of application operation in any one distributed system is solved, operation and maintenance positioning analysis can be performed quickly, and the efficiency of fault positioning is improved.
In an alternative embodiment, metadata definition is performed on the first data in the target component according to the target data structure, and generating the target metadata includes: reading, by the target component, the data value in the first environment variable and the data value in the second environment variable; and assigning the data parameters in the target data structure according to the data values in the first environment variable and the data values in the second environment variable to generate target metadata.
Specifically, after the PaaS platform and the IaaS platform complete data injection, basic data in the environment variables are read through the Commons, namely, data values in the first environment variable and data values in the second environment variable are read through the Commons, and then the data parameters in the target data structure are assigned according to the data values in the first environment variable and the data values in the second environment variable, so that target metadata are generated.
Fig. 3 is a flowchart of an alternative parameter setting of metadata according to an embodiment of the present invention, as shown in fig. 3, when a value read setting is performed in metadata in a common component, a parameter read sequence is a start parameter→an environment variable→a default value, where the start parameter may be user-defined parameter data, the environment variable is the first environment variable and the second environment variable, and the default value refers to a default value set for a necessary parameter. When the values are read in sequence, and finally, the values are not read and the necessary fields are input, reporting errors are finished, and when the values are read in the reading process, adoption (namely assignment) is carried out.
It should be noted that, by assigning values to the data parameters in the target data structure according to the data values in the first environment variable and the data values in the second environment variable, metadata data filling is achieved, so that after the values are assigned to the data parameters in the target data structure, basic data can be uniformly output to each distributed system, each distributed system uses the same metadata, metadata of each distributed system is unified, and therefore operation and maintenance positioning analysis can be rapidly performed after problems occur in application operation in any distributed system, and fault positioning efficiency is improved.
In an alternative embodiment, the target data structure comprises a first data structure for defining application node information of the application, a second data structure for defining container deployment information of the application, and a third data structure for defining physical location information of the application.
Specifically, the first data structure is an application-related data structure, the second data structure is a container-related data structure, and the third data structure is a physical deployment-related data structure. The metadata divides the basic data into three major categories of application correlation, container correlation and physical deployment correlation, and the three categories of data can be used for definitely defining the information of the running nodes, the deployment area and the like in the distributed system. FIG. 4 is a schematic diagram of an alternative metadata data structure according to an embodiment of the present invention, where structure 1 is an application-related data structure, and is used to define application node information of an application, and includes basic information such as an application name, a cluster, a logical unit, a node name, etc., through which information can be located to application information at a logical level; the structure 2 is a container related data structure, and is used for defining application node information deployed on a container (namely container deployment information of an application), including a cluster deployed by the container, a container template name, a container ID, a POD name, a PaaS template name, a container name and the like, and by means of the information, the position of a node running on a PaaS system can be positioned; the structure 3 is a physical deployment related data structure, and is used for defining physical location information of application deployment, including affiliated areas, regions, parks, IP addresses, physical units and the like, and by means of the information, the physical deployment location of the application node running can be located.
In an alternative embodiment, the data parameters in the first data structure include at least one of: the application name, cluster name, service node name, the data parameters in the second data structure include at least one of: the container ID, container name, and the data parameters in the third data structure include at least one of: the area to which the application belongs, the application IP, the campus to which the application belongs.
Specifically, the data parameters in the first data structure include at least one of: the application name, cluster name, service node name, and logic unit further include a logic unit, fig. 5 is a schematic diagram of data parameters of an optional application related data structure according to an embodiment of the present invention, and as shown in fig. 5, the structure 1 (application related data structure) is further subdivided into 4 data parameters, and the data 101 is an application name (application short for application) for uniquely defining an application, which is a key attribute in metadata; the data 102 is a cluster name, nodes under the same cluster are peer-to-peer nodes, and are the next-stage splitting of the application, and are used for independently realizing a specific cluster function; data 103 is an application service node name, e.g., a component node name for a set of services; the data 104 is a logical unit that distinguishes the logical locations of the deployment of the unitized nodes.
Specifically, the data parameters in the second data structure include at least one of: the container ID, container name, and also template name, POD name, fig. 6 is a schematic diagram of data parameters of an alternative container related data structure according to an embodiment of the present invention, and as shown in fig. 6, structure 2 (container related data structure) is further subdivided into 4 data parameters, where data 201 is a template name, for example, paaS template name; the data 202 is a POD name, consisting of an application name, cluster name, belonging park, and random number, e.g., ftas-acco-batch-wgq-6f8f978ccc-26xqm; data 203 is a container ID for uniquely distinguishing containers; the data 204 is the container name, i.e., the actual name of the container.
Specifically, the data parameters in the third data structure include at least one of: fig. 7 is a schematic diagram of data parameters of an optional physical deployment-related data structure according to an embodiment of the present invention, where the area to which the application belongs, the application IP, the park to which the application belongs, and the physical unit, and as shown in fig. 7, the structure 3 (physical deployment-related data structure) is further subdivided into 5 data parameters, and the data 301 is a region (i.e., an area) to which the application belongs, and is used for distinguishing area information; data 302 is the region where the application is deployed, e.g., beijing, shanghai, etc.; data 303 is where the application is deployed, e.g., park a, park b, park c, etc.; data 304 is application IP; the data 305 is a physical unit that distinguishes the physical location of the deployment of the unitized nodes.
It should be noted that, in this embodiment, the quick fault location is implemented by adopting a manner of unifying metadata of each distributed system, which can solve the problem that the fault location is difficult in monitoring and troubleshooting due to inconsistent basic data between each distributed system, so that monitoring is more convenient.
Therefore, through the technical scheme of the invention, the purpose of carrying out operation, maintenance and positioning analysis can be carried out rapidly after the problem of application operation in any one of the distributed systems is solved by unifying the metadata of each distributed system, so that the technical effect of improving the fault positioning efficiency is realized, and the technical problem of low fault positioning efficiency caused by inconsistent basic data used by the distributed systems with different functions in the prior art is solved.
Example 2
According to an embodiment of the present invention, there is provided an embodiment of a fault handling apparatus of a distributed system, where fig. 8 is a schematic diagram of an alternative fault handling apparatus of a distributed system according to an embodiment of the present invention, as shown in fig. 8, and the apparatus includes: the obtaining module 801 is configured to obtain first alarm information, and analyze the first alarm information to obtain a target association relationship, where the target association relationship is generated by defining metadata of a plurality of distributed systems with different functions; a generating module 802, configured to generate fault location information corresponding to the first alarm information according to the target association relationship, where the fault location information at least includes location information of an application that has a fault in the distributed system; and the processing module 803 is used for processing the failed application according to the position information.
It should be noted that the above-mentioned obtaining module 801, generating module 802, and processing module 803 correspond to steps S101 to S103 in the above-mentioned embodiment, and the three modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the above-mentioned embodiment 1.
Optionally, the generating module includes: the first determining module is used for determining target data of the failed application from a plurality of distributed systems with different functions according to the target association relation; the first generation module is used for generating fault positioning information corresponding to the first alarm information according to the target data.
Optionally, the fault handling device of the distributed system further comprises: the second determining module is used for determining a target data structure according to the data type of first data of the application in the distributed system before the first alarm information is acquired, wherein the first data represents basic information of the application; the second generation module is used for defining metadata of the first data in the target component according to the target data structure to generate target metadata, wherein the target metadata are used for describing attribute information of the application; and the third generation module is used for transmitting the target metadata to a plurality of distributed systems with different functions through the target component to generate a target association relation.
Optionally, the fault handling device of the distributed system further comprises: the first processing module is used for defining metadata of the first data in the target component according to the target data structure, and injecting the data value of the first data into a first environment variable of the PaaS platform in the process of deploying the PaaS platform associated with the distributed system before generating target metadata; and the second processing module is used for injecting the data value of the first data into a second environment variable of the IaaS platform in the process of deploying the IaaS platform associated with the distributed system.
Optionally, the second generating module includes: the reading module is used for reading the data value in the first environment variable and the data value in the second environment variable through the target component; and the fourth generation module is used for assigning values to the data parameters in the target data structure according to the data values in the first environment variable and the data values in the second environment variable to generate target metadata.
Optionally, the target data structure includes a first data structure, a second data structure, and a third data structure, wherein the first data structure is used for defining application node information of the application, the second data structure is used for defining container deployment information of the application, and the third data structure is used for defining physical location information of the application.
Optionally, the data parameters in the first data structure include at least one of: the application name, cluster name, service node name, the data parameters in the second data structure include at least one of: the container ID, container name, and the data parameters in the third data structure include at least one of: the area to which the application belongs, the application IP, the campus to which the application belongs.
Example 3
According to another aspect of the embodiments of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to execute the above-described fault handling method of the distributed system when running.
Example 4
According to another aspect of an embodiment of the present invention, there is also provided an electronic device, wherein fig. 9 is a schematic diagram of an alternative electronic device according to an embodiment of the present invention, as shown in fig. 9, the electronic device including one or more processors; and a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method for running the program, wherein the program is configured to perform the fault handling method of the distributed system described above when run. The processor when executing the program implements the following steps: acquiring first alarm information, and analyzing the first alarm information to obtain a target association relationship, wherein the target association relationship is generated by defining metadata of a plurality of distributed systems with different functions; generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises position information of an application with a fault in the distributed system; and processing the application with the fault according to the position information.
Optionally, generating fault location information corresponding to the first alarm information according to the target association relationship includes: determining target data of the failed application from a plurality of distributed systems with different functions according to the target association relation; and generating fault positioning information corresponding to the first alarm information according to the target data.
Optionally, before acquiring the first alarm information, determining a target data structure according to a data type of first data of an application in the distributed system, wherein the first data represents basic information of the application; according to the target data structure, metadata definition is carried out on the first data in the target component, and target metadata is generated, wherein the target metadata are used for describing attribute information of the application; and transmitting the target metadata to a plurality of distributed systems with different functions through the target component to generate a target association relation.
Optionally, before metadata definition is performed on the first data in the target component according to the target data structure to generate target metadata, in the process of deploying the PaaS platform associated with the distributed system, injecting the data value of the first data into the first environment variable of the PaaS platform; in the process of deploying the IaaS platform associated with the distributed system, the data value of the first data is injected into a second environment variable of the IaaS platform.
Optionally, according to the target data structure, metadata definition is performed on the first data in the target component, and generating target metadata includes: reading, by the target component, the data value in the first environment variable and the data value in the second environment variable; and assigning the data parameters in the target data structure according to the data values in the first environment variable and the data values in the second environment variable to generate target metadata.
Optionally, the target data structure includes a first data structure, a second data structure, and a third data structure, wherein the first data structure is used for defining application node information of the application, the second data structure is used for defining container deployment information of the application, and the third data structure is used for defining physical location information of the application.
Optionally, the data parameters in the first data structure include at least one of: the application name, cluster name, service node name, the data parameters in the second data structure include at least one of: the container ID, container name, and the data parameters in the third data structure include at least one of: the area to which the application belongs, the application IP, the campus to which the application belongs.
The device herein may be a server, PC, PAD, cell phone, etc.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. A method for fault handling in a distributed system, comprising:
acquiring first alarm information, and analyzing the first alarm information to obtain a target association relationship, wherein the target association relationship is generated by defining metadata of a plurality of distributed systems with different functions;
generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises position information of an application with a fault in the distributed system;
and processing the application with the fault according to the position information.
2. The method of claim 1, wherein generating fault location information corresponding to the first alarm information based on the target association relationship comprises:
determining target data of the failed application from the plurality of distributed systems with different functions according to the target association relationship;
and generating fault positioning information corresponding to the first alarm information according to the target data.
3. The method of claim 1, wherein prior to acquiring the first alert information, the method further comprises:
determining a target data structure according to the data type of first data of an application in the distributed system, wherein the first data represents basic information of the application;
according to the target data structure, metadata definition is carried out on the first data in a target component, and target metadata is generated, wherein the target metadata are used for describing attribute information of the application;
and transmitting the target metadata to the plurality of distributed systems with different functions through the target component to generate the target association relation.
4. A method according to claim 3, wherein prior to metadata defining the first data in a target component in accordance with the target data structure, generating target metadata, the method further comprises:
in the process of deploying the PaaS platform associated with the distributed system, the data value of the first data is injected into a first environment variable of the PaaS platform;
and in the process of deploying the IaaS platform associated with the distributed system, the data value of the first data is injected into a second environment variable of the IaaS platform.
5. The method of claim 4, wherein metadata defining the first data in a target component according to the target data structure, generating target metadata, comprises:
reading, by the target component, the data value in the first environment variable and the data value in the second environment variable;
and assigning the data parameters in the target data structure according to the data values in the first environment variable and the data values in the second environment variable to generate the target metadata.
6. A method according to claim 3, wherein the target data structure comprises a first data structure for defining application node information of the application, a second data structure for defining container deployment information of the application, and a third data structure for defining physical location information of the application.
7. The method of claim 6, wherein the data parameters in the first data structure comprise at least one of: an application name, a cluster name, a service node name, the data parameters in the second data structure comprising at least one of: the container ID, the container name, the data parameters in the third data structure include at least one of: the area to which the application belongs, the application IP, and the park to which the application belongs.
8. A fault handling apparatus for a distributed system, comprising:
the acquisition module is used for acquiring first alarm information, analyzing the first alarm information and obtaining a target association relation, wherein the target association relation is generated by defining metadata of a plurality of distributed systems with different functions;
the generation module is used for generating fault positioning information corresponding to the first alarm information according to the target association relation, wherein the fault positioning information at least comprises position information of an application with a fault in the distributed system;
and the processing module is used for processing the failed application according to the position information.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the fault handling method of the distributed system according to any of the claims 1 to 7 at run-time.
10. An electronic device, the electronic device comprising one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method for running a program, wherein the program is configured to perform the fault handling method of the distributed system of any of claims 1 to 7 when run.
CN202310423582.3A 2023-04-19 2023-04-19 Fault processing method and device of distributed system, storage medium and electronic equipment Pending CN116483603A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310423582.3A CN116483603A (en) 2023-04-19 2023-04-19 Fault processing method and device of distributed system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310423582.3A CN116483603A (en) 2023-04-19 2023-04-19 Fault processing method and device of distributed system, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116483603A true CN116483603A (en) 2023-07-25

Family

ID=87218889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310423582.3A Pending CN116483603A (en) 2023-04-19 2023-04-19 Fault processing method and device of distributed system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116483603A (en)

Similar Documents

Publication Publication Date Title
CN106484611B (en) Fuzzy test method and device based on automatic protocol adaptation
CN112989330B (en) Container intrusion detection method, device, electronic equipment and storage medium
WO2019128299A1 (en) Test system and test method
US20110296393A1 (en) Systems and methods for generating an encoded package profile based on executing host processes
CN108038039B (en) Method for recording log and micro-service system
CN114745295A (en) Data acquisition method, device, equipment and readable storage medium
CN113094238A (en) Method and device for monitoring abnormity of business system
CN109858257B (en) Access control method and device
CN113360386B (en) Method and device for testing driving of switching chip, electronic equipment and storage medium
CN109388770B (en) Web page generation method and device
CN116483603A (en) Fault processing method and device of distributed system, storage medium and electronic equipment
CN115496470A (en) Full-link configuration data processing method and device and electronic equipment
CN115617668A (en) Compatibility testing method, device and equipment
CN114385503A (en) Interface test method, device, equipment and storage medium
CN113934552A (en) Method and device for determining function code, storage medium and electronic device
CN114416545A (en) Method and device for determining test code coverage rate and electronic equipment
CN111324654A (en) Interface calling method, system, computer device and computer readable storage medium
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN115168489B (en) Data certification method and device based on blockchain
CN116962266A (en) Application program monitoring method and device, storage medium and electronic equipment
CN111338651B (en) Method and device for providing download resource, and method and device for downloading resource
CN118210547A (en) Service version consistency detection method, device, equipment and readable storage medium
CN114281396A (en) Deployment method and device of application system resources and computer readable storage medium
CN116627814A (en) Service processing method and device, storage medium and electronic device
CN113890846A (en) Distribution network detection method, device and system, storage medium and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination