CN113722050B - Application diagnosis assistance method, computing device, and machine-readable storage medium - Google Patents

Application diagnosis assistance method, computing device, and machine-readable storage medium Download PDF

Info

Publication number
CN113722050B
CN113722050B CN202111287067.4A CN202111287067A CN113722050B CN 113722050 B CN113722050 B CN 113722050B CN 202111287067 A CN202111287067 A CN 202111287067A CN 113722050 B CN113722050 B CN 113722050B
Authority
CN
China
Prior art keywords
resource
information
container
diagnosis
diagnostic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111287067.4A
Other languages
Chinese (zh)
Other versions
CN113722050A (en
Inventor
崔杰奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba China Co Ltd
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd, Alibaba Cloud Computing Ltd filed Critical Alibaba China Co Ltd
Priority to CN202111287067.4A priority Critical patent/CN113722050B/en
Publication of CN113722050A publication Critical patent/CN113722050A/en
Application granted granted Critical
Publication of CN113722050B publication Critical patent/CN113722050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An application diagnosis assistance method, a computing device and a machine-readable storage medium are disclosed, which are used for assisting diagnosis for applications based on containers and a Kubernetes platform. A resource tree of resources used by the application is generated based on a resource tree definition configured for the application. During the application change, the diagnostic information of the resource corresponding to each resource node is obtained based on the resource tree. Here, the diagnostic information is information that contributes to diagnosis. And then storing the diagnostic information of the resources corresponding to the acquired resource nodes as the diagnostic information of the application. Therefore, a multi-level diagnosis auxiliary scheme is provided for the application based on the container and the Kubernetes platform, information of each level from a user process to the Kubernetes during application change can be collected, and the efficiency of user positioning problems is improved.

Description

Application diagnosis assistance method, computing device, and machine-readable storage medium
Technical Field
The present disclosure relates to cloud computing technologies, and in particular, to an application diagnosis assistance method, a computing device, and a machine-readable storage medium for assisting diagnosis for an application based on a container and a kubernets platform.
Background
With the rapid development of internet technology, cloud computing is becoming an important development direction.
Services involved in cloud computing include three levels of infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS).
Platform as a service (PaaS) is an important component of cloud computing, and provides an operation platform and solution services. In a typical hierarchy of cloud computing, the PaaS layer is interposed between software as a service (SaaS) and infrastructure as a service (IaaS).
The PaaS platform is a cloud computing platform and can be used for simplifying and automating the life cycle management of the whole application program, including development, deployment and operation, reducing the expenditure of IT infrastructure, and reducing the cost and time of application development and operation and maintenance.
Kubernetes (abbreviated as "K8 s") is a container cluster management system, which is an open source platform capable of automatically implementing Linux container operation, and can be used for operating and managing a plurality of containers.
When a user uses a cloud native application scheduling platform such as PaaS/K8s, an application needs to be created or changed. For example, resources such as stateless load (delivery), Service (Service), Volume (Volume), etc. are deployed or updated on the K8s cluster, and then containers are created on the machine nodes managed by K8s to satisfy these resources.
Unlike a program that runs locally, it is difficult for a user to diagnose when an application running in K8s has an error. This is because the hierarchy at which problems may arise is distributed across multiple levels of processes, Operating Systems (OS), containers, K8s platforms, and so on.
Existing K8s resource observation tools, such as Kubewatch, can listen to the status of a specified K8s resource. When a resource event occurs, Kubewatch sends a notification. Kubewatch focuses on the change of resources, but does not sense the resource structure of an application and the state of the application, and cannot actively record the resource state when the application is changed.
In other words, in the prior art, resource information of various resources on a platform is scanned to obtain the status of each resource. The resource condition information collection is served for the platform service provider, so that the platform service provider can conveniently know the conditions of various resources owned or mastered by the platform service provider. Which respectively serve a large number of different applications. For a single application of the user, it is difficult to analyze the cause of the error or abnormality, to locate the problem, i.e. to diagnose it.
The other conventional K8s diagnostic tool Kubectl debug is debugged by copying the original Pod, adding a container or changing a command, but the diagnosis before and after the container is started cannot be completed, and the debugging means can be realized only by the tool carried by the original container mirror image.
Therefore, there is still a need for a multi-level diagnostic assistance scheme for container and K8s platform based applications (which may be referred to as "cloud-native applications") to facilitate collecting information from user processes to various levels of K8s during application changes, improving the efficiency of user location problems.
Disclosure of Invention
One technical problem to be solved by the present disclosure is to provide a multi-level diagnosis assistance scheme for container and K8s platform based applications, which can collect information from processes to K8s levels during application change, thereby improving efficiency of post-positioning problems.
According to a first aspect of the present disclosure, there is provided an application diagnosis assistance method for performing assistance diagnosis for an application based on a container and a K8s platform, the method including: generating a resource tree of resources used by the application based on a resource tree definition configured for the application, the resource tree including a plurality of resource nodes organized in a tree form, the plurality of resource nodes corresponding to a plurality of resources used by the application, respectively; during the application change, acquiring diagnostic information of resources corresponding to each resource node based on the resource tree, wherein the diagnostic information is information helpful for diagnosis; and storing the acquired diagnosis information of the resource corresponding to each resource node.
Optionally, the step of obtaining the diagnostic information of the resource corresponding to each resource node based on the resource tree during the application change includes: and responding to the creation or the update of the root resource, and traversing the resource tree corresponding to the application at least once before the creation or the update of the application is finished so as to acquire the diagnostic information of the resource corresponding to each resource node on the resource tree.
Optionally, the step of traversing the resource tree corresponding to the application includes: putting the root resource node into a queue; acquiring resource nodes from the queue, and acquiring diagnostic information of resources corresponding to the resource nodes; and if the resource node has a child resource node, placing the child resource node in a queue.
Optionally, the diagnostic information of the resource comprises at least one of: meta information of the resource; configuration information of the resource; status information of the resource; event information for describing various change information of the resource; description information of the resource.
Optionally, after each traversal is finished, the step of storing the acquired diagnostic information of the resource corresponding to each resource node is performed. All the corresponding diagnostic information of the newly appeared resource nodes in the current round of traversal is stored; for resource nodes which exist in the previous round of traversal and still exist in the round of traversal, saving new meta information, configuration, state and/or description information, and aggregating new event information and old event information together to reflect the change of resources in the whole change period; and for the resource nodes existing in the previous round of traversal and disappearing in the round of traversal, saving the information about the disappearance of the resource nodes.
Optionally, the method may further include: in response to detecting an abnormal workload resource from the diagnostic information for the resource, querying resource information for the abnormal workload resource; copying a container group related to the abnormal workload resource to obtain a copied container group; modifying the copied container group to obtain a diagnosis container group convenient for a container group diagnotor to collect diagnosis information; and carrying out diagnosis information collection on the containers in the diagnosis container group by the container group diagnoser so as to obtain the diagnosis information of the containers.
Optionally, the step of modifying the set of replicated containers comprises: adding a pause program before running and/or a pause program after running on the basis of a starting command of a container in the copied container group; auxiliary programs associated with the pre-run pause program and/or the post-run pause program are added to the container of the duplicate container group, thereby obtaining a diagnostic container group for diagnosis.
Optionally, the step of collecting diagnostic information for containers in the diagnostic container group by the container group diagnoser comprises: at least one of pre-run diagnostic information collection, run-track diagnostic information collection, and post-run diagnostic information collection is performed on the containers in the diagnostic container set.
Optionally, the receptacle group diagnoser performs diagnostic information collection on the receptacle by calling a diagnostic plug-in, the diagnostic plug-in comprising at least one of: the system calling plug-in is used for acquiring all the system calls and events generated in the execution process of the container and observing the reason of application change failure from the operating system layer; the log collection plug-in is used for acquiring the log output by the execution of the container so as to avoid the log loss after the deletion of the container; an execution tracking plug-in for tracking execution of the user mode program to observe a cause of the failure from the application layer; the resource collection plug-in is used for collecting information of at least one item of CPU, memory, network, IO and file in the starting of the container so as to assist in positioning problems; the core dump plug-in is used for acquiring the field information of the program when the program is abnormal based on the core dump generated when the program exits abnormally, so as to position the fault; and the custom plug-in is used for executing the custom container diagnosis function of the user.
Optionally, the container group diagnoser sets a hook before operation at a time point before the operation of the container, registers a diagnostic plugin which needs to be called and executed before the operation of the container to the hook before operation, executes a pause program before operation after the container in the diagnostic container group is started, enters a pause state before operation, calls the diagnostic plugin registered on the hook before execution, collects first diagnostic information, and when all diagnostic plugins registered on the hook before operation are called and executed, the container group diagnoser terminates the pause program before operation and automatically executes an original container starting command; and/or the container group diagnoser calls the execution tracking plug-in to start a background program to track the execution of the container during the operation of the container so as to collect second diagnostic information; and/or the container group diagnoser sets a running back hook at a time point after the container runs, registers the diagnosis plug-in which needs to be called and executed after the container runs to the running back hook, when the running of all the containers in the diagnosis container group is finished or the running time exceeds the auxiliary diagnosis time and is closed by the container group diagnoser, executes a running back pause program, enters a running back pause state, calls the diagnosis plug-in which is registered on the running back hook, collects information left by the process to obtain third diagnosis information, and when the diagnosis plug-in which is registered on the running back hook is called and executed, the container group diagnoser stops the running back pause program and finishes the diagnosis information collection process of the container.
Optionally, deploying the diagnostic container group and the container group diagnoser on the same machine node; and/or deploying the diagnosis container group and the container group diagnoser on the originally deployed node of the abnormal load; and/or selecting nodes having sufficient resources to carry the diagnostic container group and container group diagnosers and deploying the diagnostic container group and container group diagnosers on the selected nodes.
According to a second aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first aspect above.
According to a third aspect of the present disclosure, there is provided a computer program product comprising executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.
According to a fourth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.
Therefore, a multi-level diagnosis auxiliary scheme is provided for the application based on the container and the K8s platform, information from the user process to each level of K8s during the application change period can be collected, and the efficiency of positioning problems of the user is improved.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
FIG. 1 is an overall architecture diagram of a system for assisting in diagnosis of container and K8s platform based applications as they change according to the present disclosure.
FIG. 2 is a schematic flow diagram of a method of assisting in diagnosis of container and K8s platform based applications according to the present disclosure.
FIG. 3 is a schematic architecture diagram of a K8s resource for aiding diagnosis according to the present disclosure.
Fig. 4 schematically illustrates an exemplary application resource tree.
FIG. 5 illustratively shows that consolidated diagnostic information can be collected by accessing the API server for various resources.
FIG. 6 is a diagram of a workload architecture for which an anomaly has occurred for a Pod diagnostics pair.
Fig. 7 is a schematic flow chart of the Pod diagnostic process.
FIG. 8 is a schematic diagram of a computing device that can be used to implement the application diagnosis assistance method described above according to an embodiment of the invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The present disclosure proposes an application diagnosis assistance method for assisting diagnosis for an application based on a container and a K8s platform. And acquiring corresponding diagnosis information which is helpful for diagnosis through auxiliary diagnosis. The user can carry out diagnosis and analysis based on the acquired diagnosis information, and conveniently and quickly analyze and locate the abnormal or problem under the condition of the abnormal or problem.
In some embodiments, "aiding diagnosis" of the present disclosure may also be understood as "diagnostic information collection". An application diagnostic worker, such as a user, a K8s platform service provider, or other authorized third party, may conveniently perform diagnostic analysis, such as analyzing and locating existing anomalies or problems, based on the diagnostic information obtained using the application diagnostic support method according to the present disclosure.
That is, according to the present disclosure, it is possible to provide assistance to the user or the like for the diagnosis work of the application diagnosis worker by collecting the diagnosis information, and improve the efficiency and accuracy of the user in performing the diagnosis analysis (e.g., problem location).
Hereinafter, the "K8 s resource diagnosis controller", "Pod diagnosis controller", and "Pod diagnoser", etc. may also be understood as "K8 s resource auxiliary diagnosis controller", "Pod auxiliary diagnosis controller", and "Pod auxiliary diagnoser", respectively, belonging to the diagnosis auxiliary control tool or the diagnosis auxiliary tool.
First, a system overall architecture diagram for a container and K8s platform based application to aid in its diagnosis during its change according to this disclosure is briefly described with reference to fig. 1.
FIG. 1 is an overall architecture diagram of a system for assisting in diagnosis of container and K8s platform based applications as they change according to the present disclosure.
As shown in fig. 1, the overall system architecture includes a kubernets (K8 s) cluster and storage system.
The Kubernetes cluster is a platform that runs applications, diagnostic controllers, and diagnostics. The application diagnostic assistance scheme of the present disclosure may be implemented entirely based on kubernets.
The storage system provides a storage scheme for the application diagnosis information, and the storage scheme comprises a database, log storage, file storage and the like.
A kubernets cluster may include a control plane and several machine nodes.
The control plane includes an API Server (API Server) and a diagnostic controller in the K8s cluster responsible for assisting in diagnostics during application changes.
An API Server (API Server) is a service-side component of kubernets through which data within a kubernets cluster can be accessed and manipulated.
The API server maintains various types of native resource information and custom resource information on the cluster, such as application configuration (ApplicationConfiguration), image builder (ImageBuilder), log collection (LogCollector), stateless load (Deployment), Service (Service), container group (Pod), and the like. These resource information holds information that the user created or changed the application. The diagnostic controller may perform diagnostic information collection for the application based on the resource information.
The container group (Pod) is a Kubernetes container sandbox, can run a plurality of containers, and is an actual workload unit in the cluster.
A stateless load (Deployment) is an encapsulation of a Pod in kubernets, and may specify the number of pods and ensure that number of pods is running.
As shown in fig. 1, the control plane of the present application includes two aspects of the diagnostic controller, namely a K8s resource diagnostic controller and a Pod diagnostic controller.
An application deployed on the Paas platform may involve multiple K8s resources. These K8s resources may be organized into a resource tree. The K8s resource diagnosis controller can track the resource information of the application during the change, periodically inquire the information of the whole resource tree, and when the application reaches the final state, the resource information can be stored in a database of the storage system in a persistent mode.
And the Pod diagnosis controller creates a diagnosis task (Job) according to the abnormal load information and releases resources occupied by the diagnosis task after the diagnosis is finished. Here, Job is the encapsulation of a Pod in Kubernets to run a certain number of pods to a successful exit state.
The machine node is the machine that actually runs the container in the K8s cluster, and is responsible for running the application container and the diagnostics container.
And operating a Pod diagnoser in the diagnoser container, taking charge of duplicating abnormal pods to operate again, collecting diagnostic information of the container in the abnormal pods, and performing persistent storage after the diagnostic information is collected.
The applied diagnostic assistance scheme of the present disclosure is briefly described below in conjunction with fig. 1.
First, at step a1, application K8s resource information is acquired.
The K8s resource diagnosis controller can obtain the relevant resource information of all the resources related to the application from the API server and arrange the information beneficial to the diagnosis of the user.
At step a2, application resource tree information is generated.
The present disclosure proposes that the acquired application resource information may be organized into a resource tree. The resource tree may include a plurality of resource nodes organized in a tree. The plurality of resource nodes respectively correspond to a plurality of resources used by the application.
Each resource (node) may have a corresponding child resource (node) and parent resource (node). The root resource (node) has no parent resource (node). The end resource (node) has no child resources (nodes).
The controller can track resource information of the application during application change, periodically query information of the whole resource tree, and persistently store the resource information to a database for subsequent query. This will be described in detail below.
For general resources, information related to the resources is easy to analyze. When abnormality occurs, the problem can be positioned by directly analyzing the diagnosis information collated from the resource information.
For workloads, the situation is more complex. When the workload is abnormal, further analysis and processing can be carried out, and further diagnosis information is collected, so that better diagnosis and analysis can be carried out at the later stage, and better positioning of problems is realized.
Thus, at step a3, an abnormal workload may be submitted to the Pod diagnostic controller.
Specifically, if the K8s resource diagnosis controller detects a workload in an anomaly, such as Pod (container group), Deployment (stateless load), and repliaset (copy set), from the resource information, the information about this workload is submitted to the Pod diagnosis controller.
Thus, at step a4, the Pod diagnostic controller obtains abnormal workload information from the K8s resource diagnostic controller.
The Pod diagnosis controller may query the API server for specific information of the abnormal workload according to the abnormal workload information, so as to generate a workload (e.g., a diagnosis container group) for diagnosis in a subsequent diagnosis information collection of the abnormal workload.
At step a5, the Pod diagnostic controller controls a diagnostic task (Job) that is used to perform information gathering on the diagnostic information.
The Pod diagnostic controller creates a diagnostic task on the machine node and controls the operation of the diagnostic task.
In addition, after the diagnosis task is finished, the Pod diagnosis controller also cleans the diagnosis task and the container group (Pod) for collecting the diagnosis information.
At step a6, diagnostic information collection is performed on the container.
Pod diagnostics on machine nodes will replicate abnormal pods and invade the replicated pods, adding life cycle tracking to perform diagnostic information collection for containers therein.
Then, at step a7, container diagnostic information is stored.
Here, after the diagnosis is completed, the Pod diagnoser may store the diagnostic information of the container in a storage system as a log or a file to provide to a user.
The overall architecture and general flow of the diagnostic assistance scheme during application change according to the present disclosure is briefly described above with reference to fig. 1.
As shown in fig. 1, the overall architecture of the applied diagnosis assistance scheme according to the present disclosure mainly involves two aspects of diagnosis:
(1) k8s resource diagnosis information collection of the K8s resource diagnosis controller;
(2) and collecting Pod diagnosis information of the Pod diagnosis controller and the Pod diagnosis device.
As described above, the (2) aspect is proposed to facilitate further diagnostic information collection for abnormal workloads when detecting that abnormal K8s resources are workload resources such as Pod (container group), Deployment (stateless load), and repliaset (copy set) on the basis of the diagnostic information collection for K8s resources in the (1) aspect.
These two aspects are described in detail below, respectively.
First, K8s resource diagnosis information collection.
First, a method for performing a diagnosis aid based on a container and an application of the K8s platform according to the present disclosure will be briefly described with reference to fig. 2.
FIG. 2 is a schematic flow diagram of a method of assisting in diagnosis of container and K8s platform based applications according to the present disclosure.
As shown in fig. 2, first, in step S210, a resource tree of resources used by an application is generated based on a resource tree definition configured for the application.
The resource tree may include a plurality of resource nodes organized in a tree. The resource nodes respectively correspond to a plurality of resources used by the application. In the resource tree, child resource nodes are subordinate to parent resource nodes.
Thus, in step S220, during the application of the change, the diagnostic information of the resource corresponding to each node can be acquired based on the resource tree.
It should be understood that "diagnostic information" refers to information that facilitates diagnosis. For various resources, corresponding information helpful for diagnosis can be collected, and application diagnosis workers (such as users) can find abnormality and locate problems based on the diagnosis information. An example of diagnostic information for a resource will be described below with reference to fig. 5. It should be understood that the diagnostic information is not so limited. The diagnostic information that needs to be collected accordingly can be set for various resources and various diagnostic tasks.
Then, in step S230, the diagnostic information of the resource corresponding to each resource node acquired in step S220 may be stored as the diagnostic information of the application.
The diagnostic information of the resources may be stored, for example, in the form of a list. The diagnostic information of an application may be stored as a file or a table, for example.
An exemplary method for assisting in the diagnosis of K8s resources based on containers and applications of the K8s platform according to the present disclosure is described in further detail below with reference to FIG. 3.
FIG. 3 is a schematic architecture diagram of a K8s resource for aiding diagnosis according to the present disclosure.
As shown in fig. 3, in step b1, the K8s resource diagnostic controller obtains the resource tree definition, parses it and obtains the application-related resource information.
At the time of startup of the K8s resource diagnostic controller, a definition of the application-dependent K8s resource tree needs to be provided. The K8s resource diagnostic controller may query the application resource status during application change according to the configured resource tree.
The application resource tree of the present disclosure may represent any resource structure and type. By specifying the definition of the resource tree, the controller can dynamically parse the structure of the tree.
An application resource tree according to the present disclosure is described in detail below with reference to fig. 4.
The structure of the resource tree may not be constrained by the controller. Moreover, the resource tree can be updated according to actual conditions.
Fig. 4 schematically illustrates an exemplary application resource tree.
As shown in FIG. 4, each resource (or "resource node") may have 0 to a number of child resource nodes.
In fig. 4, application configuration (application configuration/application) is a root resource node and represents an application. And the other sub-resource nodes are respectively responsible for maintaining each function of the application.
For example, as shown in fig. 4, child resource nodes of the ApplicationConfiguration (application configuration/application) may be Rollout (deployment), servicetrack (SLB/service), ImageBuilder (mirror construction), LogCollector (log collection), DynamicLabel (dynamic scaling), AutoScaling (elastic scaling), and the like.
A child resource node of Rollout may have a Deployment. A child resource node of a Deployment (stateless load) may have a replicase (copy set). A child resource node of a repliaset (copy set) may have a Pod (container group).
A child resource node of servicetransit (SLB/Service) may have Service.
A child resource node of the LogCollector (log collection) may have an AliyunLogConfig (SLS configuration).
The child resource node of the AutoScaling may have a scaled object (KEDA elastic resource). The child resource node of the ScaledObject (KEDA flexible resource) may have HPA (horizontal flexible scaling).
The resource tree may be defined using YAML (a data serialization language).
If a resource node has child resource nodes, the resource node can be defined as a YAML object. If there are multiple child resource nodes, they can be represented by YAML array. If there is only a single child resource, it can be represented by a YAML object.
If a resource node has no child resource nodes, a string representation may be used.
According to the rule, the YAML configuration is added and deleted, and resource trees meeting various composite requirements can be defined at will.
The YAML definition for the exemplary resource tree shown in FIG. 4 may be as follows:
resourceTree:
ApplicationConfiguration:
- Rollout:
Deployment:
Replicaset: Pod
- ServiceTrait: Service
- ImageBuilder
- LogCollector: AliyunLogConfig
- DynamicLabel
- AutoScaling:
ScaledObject: HPA
it should be understood that only one exemplary YAML definition is presented herein.
Thus, the K8s resource diagnostic controller, at startup, reads the YAML configuration of the resource tree and recursively parses it into the corresponding K8s resource tree. Thus, in step S210, a resource tree of resources used by the application is generated based on the resource tree definition configured for the application.
A resource node may hold a particular type of K8s resource, a child resource node that is subordinate to the resource node, and a parent resource node that the resource node is subordinate to. The resource tree may be traversed downward through the child resource nodes of the current resource node. Whether the current resource node belongs to the child resource node can be judged through the father resource node.
The resource node may represent any K8s resource type. Moreover, the resource tree can be arbitrarily defined.
The definition of a resource node may be as follows:
type ResourceNode struct {
GVK schema.GroupVersionKind
Parent *ResourceNode
Children []*ResourceNode
}
it should be understood that only one exemplary resource node definition is presented herein.
In the following, the application tracking scheme according to the present disclosure is further described with continued reference to fig. 3. Referring to fig. 3, a detailed description is given of how, during the application of a change, diagnostic information of resources corresponding to each node can be collected and collated based on a resource tree by tracking the change process of the application at step S220 of fig. 2.
There may be multiple applications in the cluster that are subject to change. For each application, the K8s resource diagnostic controller needs to track changes in the condition of the resource individually.
As shown in fig. 3, the present disclosure can track multiple (even all) applications (application a, application B) in a change simultaneously in a concurrent manner.
In step b2 of fig. 3, a change of the root resource is obtained, and in response to the change of the root resource, the application corresponding to the root resource is tracked.
A change in the root resource indicates that the application may have changed.
At this time, it can be determined whether the change occurred belongs to resource creation, resource update, or resource deletion.
When the change of the root resource belongs to the resource deletion, the application corresponding to the root resource does not exist any more, and no error occurs, so that the root resource can not be diagnosed any more.
When the change of the root resource belongs to resource creation or resource update, it is explained that the change is creation or update of an application, and a diagnostic operation during change is required.
Therefore, in response to the creation or update of the root resource, the resource tree corresponding to the application may be traversed at least once before the creation or update of the application is completed, so as to obtain the diagnostic information of the resource corresponding to each resource node on the resource tree.
For example, an application corresponding to a newly created or updated root resource may be placed in an application queue to be tracked.
The K8s resource diagnostic controller may retrieve the change application to be tracked from the queue of applications to be tracked.
For example, it may be checked whether the application has been tracked. If already tracked, it may be ignored. If not already tracked, a coroutine may be initiated to begin loop tracking of the application.
And each application in change has a corresponding coroutine for tracking, the controller records the coroutine being tracked, cancels the tracking of the application when the application reaches a final state, and re-tracks when the application is changed again.
Application-related resource information is stored in the API server. Accordingly, in step b3 of fig. 3, the controller may obtain resource information from the API server.
In step b4 of fig. 3, during the application change process, the resource tree corresponding to the application is traversed at least once before the application reaches the final state (i.e., the change is completed and the application is no longer changed), i.e., before the creation or update of the application is completed. For example, the resource tree may be traversed periodically, at intervals.
For example, a breadth first traversal scheme may be used. For each application during change, the K8s resource diagnosis controller may search from the root resource of the resource tree using a breadth-first traversal scheme, query the API server for the status of its corresponding resource (e.g., may be referred to as "resource information") for each resource node, and arrange the status into a form convenient for diagnosis to obtain the diagnosis information of the corresponding resource. For example, information necessary for diagnosis may be extracted from redundant resource information, and the extracted information may be arranged in a predetermined format or format as diagnosis information for a corresponding resource.
An exemplary scheme for traversing the application resource tree is described in detail below.
First, a root resource may be placed in a queue.
Thus, the resource node may be obtained from the queue each time the diagnostic information of the resource corresponding to the resource node is obtained. For example, the resource information of the resource corresponding to the resource node may be obtained from the API server, and the resource information may be collated to obtain the diagnosis information.
Then, if the resource node has a child resource node, the child resource node is added to the queue.
This loops until the queue is empty, i.e., all resource nodes are traversed.
FIG. 5 illustratively shows that consolidated diagnostic information can be collected by accessing the API server for various resources.
As shown in fig. 5, for a resource such as a Deployment (stateless load), collecting diagnostic information for consolidated resources may include at least one of:
(1) meta information (metadata) of the resource, for example, including information such as a resource name, a tag (cable), and creation time;
(2) configuration information (spec) of the resource, such as an expected configuration of the resource, etc.;
(3) status information (status) of the resource, which indicates the current status of the resource and includes information on whether the terminal status has been reached;
(4) event information, recording resource-related events, including information (info) level events and warning (warning) level events;
(5) description information (describedinfo) of the resource.
Wherein the meta information, the configuration information and the state information of the resource belong to the basic information of the resource.
The event information of the resource describes various change information of the resource. The event in K8s survives for 1 hour, and only the last 1 hour of the event was acquired each time.
The description information of the resource may be consistent with the information obtained using the Kubectl descriptor, which may be in a form more convenient for the user to read. Kubectl, among others, is a command line client of Kubernets.
The K8s resource diagnostic controller may collate the diagnostic information for each resource collection. The information can help the user to quickly judge whether the resource is healthy or not and locate the existing problem.
Then, in step b5 of fig. 3, when the application change reaches the final state (the change is completed and no further change is performed), the K8s resource diagnosis controller may store the latest (changed, final state) application diagnosis information related to each resource.
This makes it possible to update the diagnostic information relating to each resource.
The acquired diagnosis information of the resources corresponding to the resource nodes can be stored after the resource tree is traversed every time. For example, the obtained diagnostic information for the resource may be serialized into Json format, stored, for example, in Kubernetes ConfigMap.
If it is the first traversal, the stored item can be created directly. Otherwise, the previously recorded content may be updated. One or more of the following operations may be performed at the time of update:
(1) all the diagnostic information of the corresponding resources of the newly appeared resource nodes in the traversal of the current round is stored;
(2) for resource nodes which exist in the previous round of traversal and still exist in the round of traversal, saving new meta information, configuration, state and/or description information, and aggregating new event information and old event information together to reflect the change of resources in the whole change period;
(3) and for the resource nodes existing in the previous round of traversal and disappearing in the round of traversal, saving the information about the disappearance of the resource nodes.
So far, a solution for assisting diagnosis of K8s resources during changes for container and K8s platform based applications according to the present disclosure has been described in detail.
For resources such as Service (Service), log collection (LogCollector), image construction (ImageBuilder), dynamic label (dynamic label) in the resource tree shown in fig. 4, collecting and collating the diagnosis information from the API server is often sufficient to perform resource diagnosis and problem location.
For a stateless load (Deployment), that is, a workload, it can be determined that the workload is abnormal through the diagnostic information of the corresponding resource, but further analysis may be required to obtain more specific information.
Pod diagnostic information collection according to the present disclosure is described in detail below with reference to fig. 6.
And II, collecting Pod diagnosis information.
Pod diagnostic information gathering is performed mainly by two parts, Pod diagnostic controller and Pod diagnoser in fig. 1.
First, the Pod diagnostic controller is briefly described. The operation of the Pod diagnostic controller may have several aspects as follows.
First, abnormal workload information is obtained.
The Pod diagnostic controller obtains information of workload abnormality from the K8s resource diagnostic controller (step a3 in fig. 1), creates a Pod diagnosticator to perform diagnostic information collection (or "auxiliary diagnosis") on the workload (step a4 in fig. 1), and maintains the state of the Pod diagnosticator (step a5 in fig. 1).
After receiving the information of the abnormal workload, the Pod diagnosis controller acquires the abnormal workload from the API server. For example, information such as the name, namespace, and node where the abnormal workload resides may be queried from the API server for diagnostics.
Second, a Pod diagnoser is created.
Based on the acquired information, the Pod diagnostic controller may create a Pod diagnostic.
Pod diagnostics may be deployed on machine nodes that have sufficient resources to carry container group diagnostic tasks. Thus, the problem that auxiliary diagnosis cannot be carried out due to insufficient resources can be avoided.
Alternatively, Pod diagnostics may also be deployed on machine nodes on which the exception load was originally deployed. In this way, the time consumed pulling the image can be avoided.
Both of these issues can be avoided if the node originally deployed with the abnormal load happens to have sufficient resources to carry the container group diagnostic task.
Deployment can be done using the form of the kubernets task (Job), requiring the specification of the nodes to be deployed.
The Pod diagnoser and the diagnosed Pod (a diagnostic container group or a diagnostic Pod) can be on the same machine node, and a root namespace is used, so that the Pod diagnoser can access the progress of the diagnosed Pod, and meanwhile, the container progress authority is configured, and the diagnostic container can monitor and control the container progress.
Third, the Pod diagnostics state is maintained.
The Pod diagnostic controller may also be used to maintain Pod diagnostic status.
The Pod diagnostic controller may check the state of the diagnostician Job. When diagnostician Job reaches a success state (diagnosis successful), the Pod diagnostic controller may delete diagnostician Job and the possible remaining diagnostic Pod. When the diagnoser Job reaches the failure state (diagnosis failed), information of this failed Pod diagnosis may be recorded.
Hereinafter, the Pod diagnostic assistance scheme according to the present disclosure is described in further detail with reference to fig. 6 and 7.
FIG. 6 is a diagram of a workload architecture for which an anomaly has occurred for a Pod diagnostics pair.
Fig. 7 is a schematic flow chart of the Pod diagnostic process.
The Pod diagnostician first creates a diagnostic Pod dedicated to diagnosis by the intrusion Pod (c 2.1 in fig. 6), and then diagnostic information collection is performed on the diagnostic Pod by the Pod diagnostician.
First, in step S710, in response to detecting an abnormal workload resource from the diagnostic information of the resource, the Pod diagnoser queries the resource information of the abnormal workload resource.
The Pod diagnostics need to obtain the definition of the abnormal Pod, including resource information such as configuration information, Pod information, container information, etc., so as to copy a Pod for diagnosis in step S720, and modify the diagnosis Pod to obtain a diagnosis Pod for facilitating the diagnosis information collection of the Pod diagnostics in step S730, thereby initiating the diagnosis information collection of the abnormal Pod in step S740.
These pieces of resource information are explained below.
c1.1 configuration information
Configuration information includes, for example, workload information, node information, container runtime, diagnostic plug-ins, and the like.
(1) The working load is as follows:
the type and name of the target workload are specified, including Pod-defined types such as Deployment, Job, and Pod itself. In conjunction with the type and name of the workload, specific resources may be located.
(2) Node information:
nodes designated for diagnostic information collection, diagnostic tools and abnormal workloads may run on the same node.
(3) When the container is operated:
the container runtime has container and Docker, and their APIs are different, and the use of the diagnostic tool requires the specification of a specific runtime for querying the container information. Among them, virtualization technologies such as Docker are used for a running virtualization environment.
(4) A diagnostic plug-in:
and specifying the plug-in which the diagnosis needs to be started, and enabling part of the plug-ins according to the scene. The diagnostic insert will be further described below.
c1.2Pod information
The definition of an abnormal Pod is obtained and can be used to replicate a new Pod for intrusion and diagnosis. The replication of the exception Pod will be described in more detail later.
c1.3 Container information
Based on the container information, a starting command of the container mirror image can be acquired, and the starting command is used for packaging the original starting command and adding a pause process; process Id information for the container is obtained and provided to the diagnostic plug-in for tracking the specified container process. The packaging of the startup command, the provision of the diagnostic plug-in, etc., will be described in more detail later.
Then, at S720, the Pod involved in the abnormal workload resource is copied, resulting in a copied Pod.
Here, the mirror specified in the exception Pod may be pulled to duplicate the exception Pod.
The image contains default startup commands, and when no startup command is specified in the Pod definition, the image will use its own commands.
Then, in step S730, i.e., c2.2 in fig. 6, the copied Pod is modified to obtain a diagnostic Pod for facilitating diagnostic information collection by the Pod diagnostician.
The modification of the duplicate Pod may include the following two aspects.
One aspect is to wrap the start command, i.e. add a pre-run pause program and/or a post-run pause program on the basis of the original start command of the container of the copied Pod.
The packing logic may be as follows: if the original starting command of the application is cmd, the package is bash-c 'pause', cmd 'pause', and a pause program is added before and after the original command. Wherein the function of operating programs in sequence by Bash is used, the following is a specific example:
// 1. application Start Command
/bin/bash /home/admin/start.sh
// 2. start-up command after packaging
bash -c "pause; /bin/bash start.sh; pause"
It should be understood that the present disclosure is only illustrative of one command wrapper scheme.
Another aspect is to add an auxiliary program, i.e. an auxiliary program associated with a pre-run pause program and/or a post-run pause program, in a container of the copy Pod.
For example, as described above, the wrapper start command requires a pause program and a Bash script interpreter.
These programs are not present in the container of the exception Pod. Therefore, it needs to be introduced by means of Kubernets Volume. The Pod diagnostics add an auxiliary program Volume to the Pod definition.
Bash is a command line interpreter that can execute programs. Volume is a storage resource in kubernets that can be shared across all containers within a Pod.
Through the definition based on the abnormal Pod, after the copied Pod is invaded, a Pod specially used for diagnosis is additionally created, and the Pod can be called as a 'diagnosis Pod'.
Then, diagnostic information collection by the Pod diagnoser may be performed on the container in the diagnostic Pod to obtain diagnostic information of the container at step S740.
The Pod diagnoser performs at least one of pre-run diagnostic information collection, run-track diagnostic information collection, and post-run diagnostic information collection on a container in the diagnostic Pod.
Full life cycle diagnostic information collection can be performed on the container by performing pre-run diagnostic information collection, run trace diagnostic information collection, and post-run diagnostic information collection.
Full life cycle diagnostics on the container (c 3 in fig. 6) are described further below.
The use of a post-package start command allows for a pause before and after the container is run. The Pod diagnostics use the pause to provide hook points for timing points before and after operation and to provide information such as the name, image, and progress of the container. The diagnostic plug-in defines hook functions at these hook points, and completes the full life cycle of diagnostic information collection using container information provided by the Pod diagnostic controller.
The pre-operation diagnostic information collection, the operation tracking diagnostic information collection, and the post-operation diagnostic information collection are described below, respectively.
Operation c3.1, pre-run diagnostic information collection.
And the Pod diagnostor sets a hook before operation at a time point before the operation of the container and registers the diagnostic plug-in which needs to be called and executed before the operation of the container to the hook before operation.
After the containers in the diagnosis container group are started, the pause program before operation is executed firstly, and the pause state before operation is entered.
Thus, a diagnostic plug-in registered on the hook before execution may be invoked to collect first diagnostic information (which may also be referred to as "pre-run diagnostic information").
The diagnostic plug-in may register operations needed before running onto the pre-run hook, such as a system call plug-in dynamically attaching to the container process at this time.
When all diagnostic plugins registered by the hook before running are called to be executed, the Pod diagnostic device terminates the pause process before running, and the Bash automatically executes the original application starting command.
Operation c3.2, trace diagnostic information collection is run.
The Pod diagnoser, during the container runtime, calls the execution tracking plug-in may initiate execution of a daemon tracking application to collect second diagnostic information (which may also be referred to as "running diagnostic information").
Operation c3.3, post-run diagnostic information collection.
And the Pod diagnostor sets a post-operation hook at a time point after the container is operated, and registers the diagnostic plug-in which needs to be called and executed after the container is operated to the post-operation hook.
The Pod diagnostics will monitor the running of the application/container at all times, and when the predetermined auxiliary diagnostic time is exceeded, the diagnostics will shut down the application/container process and enter a post-run diagnostic phase, or when all applications/containers are finished, will also automatically enter a post-run diagnostic information collection phase.
Thus, when all containers in the diagnostic Pod are finished running or are closed by the container group diagnostic because the running time exceeds the auxiliary diagnostic time, the Pod diagnostic executes the pause-after-running program and enters the pause-after-running state.
When the stage is reached, the Pod diagnoser runs the post-operation hook, calls a diagnostic plug-in registered on the post-operation hook, and collects information left by the process to acquire third diagnostic information (which can also be called as 'post-operation diagnostic information').
And when all the diagnosis plug-ins registered on the operated hooks are called and executed, the Pod diagnosis device stops the process after operation and ends the whole diagnosis process.
Thus, the diagnostic information of the aforementioned container may include first diagnostic information (pre-operation diagnostic information), second diagnostic information (in-operation diagnostic information), and third diagnostic information (post-operation diagnostic information). The diagnostic information of the container thus obtained may be stored as part of the diagnostic information of the aforementioned application.
As mentioned above, Pod diagnostics may perform diagnostic information collection on a container by calling a diagnostic plug-in.
Here, the Pod diagnostics define diagnostic functions in the form of plug-ins, which may be configured to enable specific plug-ins, each of which is responsible for a diagnostic function, to collect diagnostic information for an aspect accordingly.
In fig. 6, the following six diagnostic inserts (c 4) are shown as an example.
1. A system calling plug-in: the method is used for acquiring all the system calls and events occurring in the container execution process and observing the reason of application change failure from an operating system (os) layer.
2. A log collection plug-in: the log used for acquiring the execution output of the container can avoid the log loss after the container is deleted.
3. Executing the tracing plug-in: for tracking the execution of user-mode programs, e.g., using eBPF, the cause of failure is observed from the application layer. eBPF is a type of sandbox program that can be run in the operating system kernel, making the kernel programmable without requiring changes to the source code.
4. A resource collection plug-in: the method is used for collecting CPU, memory, network, IO and file information in container starting to assist in positioning problems.
5. Core dump (Coredump) plug-in: the method is used for obtaining field information of the program in the abnormal state based on the core dump generated when the program exits abnormally, so as to be used for positioning the fault, Coredump is generated when the binary program exits abnormally, Coredump is generated when the Java program exits abnormally, and the information keeps the field of the program in the abnormal state and can be used for positioning the fault.
6. Self-defining plug-ins: the Pod diagnostic engine is used for executing a user-defined Pod diagnostic function, a user can define any Pod diagnostic plug-in, and the Pod diagnostic engine disclosed by the invention can support the expansion of a transverse plug-in.
When the diagnostician completes the diagnosis of the Pod, the diagnostician persists the application diagnostic information obtained by the plug-ins into a log or file store (c 5 in fig. 6) for the user to query for additional diagnosis.
Up to this point, the application diagnosis assistance method according to the present disclosure has been described in detail.
In the embodiment of the present disclosure, all K8s resources involved by an application are defined by using a resource tree, and any tree structure is supported, so that diagnostic information of all related resources applied during a change can be acquired. The resource tree and state of the application can be sensed, and diagnostic information for all relevant resources applied during the change is collected and stored.
By adopting the resource tree, the application diagnosis auxiliary method of the embodiment of the disclosure supports the collection of the diagnosis information on a plurality of layers such as a process, an operating system, a container, K8s, and the like, and obtains the unified diagnosis information applied to the plurality of layers.
Additionally, in a further embodiment, by modifying the container's start command, the start exit of the container may be suspended, thereby enabling complete diagnostic information collection for the container's full life cycle. Moreover, the diagnostic information collection before the container is started and after the container is withdrawn can be realized, and the container is prevented from being withdrawn quickly or losing the site.
In addition, the diagnosis auxiliary means for the container is not limited to a tool carried by a mirror image, and the diagnosis auxiliary means can also adopt an insert mode, can freely expand a diagnosis component, and can selectively enable partial inserts.
FIG. 8 is a schematic diagram of a computing device that can be used to implement the application diagnosis assistance method described above according to an embodiment of the invention.
Referring to fig. 8, computing device 800 includes memory 810 and processor 820.
The processor 820 may be a multi-core processor or may include multiple processors. In some embodiments, processor 820 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 820 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
The memory 810 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 820 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 810 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 810 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 810 has stored thereon executable code that, when processed by the processor 820, may cause the processor 820 to perform the application diagnostic assistance methods described above.
The application diagnosis assistance method according to the present invention has been described above in detail with reference to the accompanying drawings.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (13)

1. An application diagnosis assisting method is used for assisting diagnosis for application based on a container and a Kubernetes platform, and comprises the following steps:
generating a resource tree of resources used by an application based on a resource tree definition configured for the application, the resource tree including a plurality of resource nodes organized in a tree form, the plurality of resource nodes respectively corresponding to a plurality of resources used by the application, the resources being kubernets resources;
during the application change, acquiring diagnostic information of resources corresponding to each resource node based on the resource tree, wherein the diagnostic information is information helpful for diagnosis; and
and storing the acquired diagnosis information of the resource corresponding to each resource node as the diagnosis information of the application.
2. The method of claim 1, wherein the step of obtaining diagnostic information for resources corresponding to each resource node based on the resource tree during the applying of the change comprises:
and responding to the creation or the update of the root resource, and traversing the resource tree corresponding to the application at least once before the creation or the update of the application is finished so as to acquire the diagnostic information of the resource corresponding to each resource node on the resource tree.
3. The method of claim 2, wherein traversing the resource tree to which the application corresponds comprises:
putting the root resource node into a queue;
acquiring resource nodes from the queue, and acquiring diagnostic information of resources corresponding to the resource nodes; and
and if the resource node has a child resource node, the child resource node is placed in a queue.
4. The method of claim 2, wherein the diagnostic information of the resource comprises at least one of:
meta information of the resource;
configuration information of the resource;
status information of the resource;
event information for describing various change information of the resource;
description information of the resource.
5. The method of claim 4, wherein the step of storing the acquired diagnostic information for the resource corresponding to each resource node is performed after each traversal is completed, wherein,
all the corresponding diagnostic information of the newly appeared resource nodes in the current round of traversal is stored;
for resource nodes which exist in the previous round of traversal and still exist in the round of traversal, saving new meta information, configuration, state and/or description information, and aggregating new event information and old event information together to reflect the change of resources in the whole change period;
and for the resource nodes existing in the previous round of traversal and disappeared in the round of traversal, saving the information about the disappeared resource nodes.
6. The method of claim 1, further comprising:
in response to detecting an abnormal workload resource from diagnostic information for a resource, querying resource information for the abnormal workload resource;
copying the container group related to the abnormal workload resource to obtain a copied container group; and
modifying the duplicate container group to obtain a diagnostic container group for facilitating diagnostic information collection by a container group diagnoser;
the method comprises the steps of carrying out diagnosis information collection on containers in a diagnosis container group by a container group diagnoser to obtain diagnosis information of the containers.
7. The method of claim 6, wherein modifying the set of duplicate containers comprises:
adding a pause program before running and/or a pause program after running on the basis of the starting command of the containers in the copy container group;
and adding auxiliary programs related to the pause program before running and/or the pause program after running into a container of the copy container group, thereby obtaining a diagnosis container group for diagnosis.
8. The method of claim 7, wherein the step of collecting diagnostic information by the container group diagnoser for the containers in the diagnostic container group comprises:
at least one of pre-run diagnostic information collection, run-track diagnostic information collection, and post-run diagnostic information collection is performed on the containers in the diagnostic container set.
9. The method of claim 8, wherein the receptacle group diagnoser performs diagnostic information collection for a receptacle by invoking a diagnostic plug-in comprising at least one of:
the system calling plug-in is used for acquiring all the system calls and events generated in the execution process of the container and observing the reason of application change failure from the operating system layer;
the log collection plug-in is used for acquiring the log output by the execution of the container so as to avoid the log loss after the deletion of the container;
an execution tracking plug-in for tracking execution of the user mode program to observe a cause of the failure from the application layer;
the resource collection plug-in is used for collecting information of at least one item of CPU, memory, network, IO and file in the starting of the container so as to assist in positioning problems;
the core dump plug-in is used for acquiring the field information of the program when the program is abnormal based on the core dump generated when the program exits abnormally, so as to position the fault;
and the custom plug-in is used for executing the custom container diagnosis function of the user.
10. The method of claim 9, wherein,
the method comprises the steps that a container group diagnotor sets a hook before operation at a time point before the operation of a container, registers a diagnosis plug-in which needs to be called and executed before the operation of the container to the hook before operation, executes a pause program before operation after the container in a diagnosis container group is started, enters a pause state before operation, calls the diagnosis plug-in registered on the hook before operation, collects first diagnosis information, and stops the pause program before operation after all diagnosis plug-ins registered on the hook before operation are called and executed, and automatically executes an original container starting command; and/or
During the running of the container, the container group diagnotor calls the execution tracking plug-in to start a background program to track the execution of the container so as to collect second diagnostic information; and/or
The method comprises the steps that a container group diagnotor sets a running back hook at a time point after a container runs, a diagnosis plug-in which is required to be called and executed after the container runs is registered to the running back hook, when all containers in the diagnosis container group are run to be finished or the running time exceeds the auxiliary diagnosis time and is closed by the container group diagnotor, a running back pause program is executed, a running back pause state is entered, the diagnosis plug-in which is registered on the running back hook is called, information left by a process is collected to obtain third diagnosis information, after all the diagnosis plug-ins which are registered on the running back hook are called and executed, the running back pause program is stopped by the container group diagnotor, and the diagnosis information collection process of the container is finished.
11. The method of claim 6, wherein,
deploying the set of diagnostic containers and the set of container diagnosers on the same machine node; and/or
Deploying the diagnostic container group and the container group diagnosers on nodes originally deployed by abnormal loads; and/or
Selecting nodes having sufficient resources to carry the diagnostic container group and the container group diagnoser, and deploying the diagnostic container group and the container group diagnoser on the selected nodes.
12. A computing device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 11.
13. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-11.
CN202111287067.4A 2021-11-02 2021-11-02 Application diagnosis assistance method, computing device, and machine-readable storage medium Active CN113722050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111287067.4A CN113722050B (en) 2021-11-02 2021-11-02 Application diagnosis assistance method, computing device, and machine-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111287067.4A CN113722050B (en) 2021-11-02 2021-11-02 Application diagnosis assistance method, computing device, and machine-readable storage medium

Publications (2)

Publication Number Publication Date
CN113722050A CN113722050A (en) 2021-11-30
CN113722050B true CN113722050B (en) 2022-02-11

Family

ID=78686422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111287067.4A Active CN113722050B (en) 2021-11-02 2021-11-02 Application diagnosis assistance method, computing device, and machine-readable storage medium

Country Status (1)

Country Link
CN (1) CN113722050B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4343545A1 (en) * 2022-09-21 2024-03-27 Siemens Aktiengesellschaft Automatically assigning changed entitlements for diagnostic purposes for already launched work container instances

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210103554A1 (en) * 2019-10-04 2021-04-08 Robin Systems, Inc. Rolling Back Kubernetes Applications Including Custom Resources

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108737215A (en) * 2018-05-29 2018-11-02 郑州云海信息技术有限公司 A kind of method and apparatus of cloud data center Kubernetes clusters container health examination
US10983895B2 (en) * 2018-06-05 2021-04-20 Unravel Data Systems, Inc. System and method for data application performance management
CN109947627A (en) * 2019-03-29 2019-06-28 神州数码信息系统有限公司 A kind of multi layer cloud application monitors diagnostic method based on resource transfer chain
CN111367659B (en) * 2020-02-24 2022-07-12 苏州浪潮智能科技有限公司 Resource management method, equipment and medium for nodes in Kubernetes
US11586477B2 (en) * 2020-03-26 2023-02-21 Vmware, Inc. System and method for benchmarking a container orchestration platform
CN111858117B (en) * 2020-06-30 2024-05-14 新浪技术(中国)有限公司 Method and device for diagnosing faults Pod in Kubernetes cluster
CN112417051A (en) * 2020-12-01 2021-02-26 腾讯科技(深圳)有限公司 Container arrangement engine resource management method and device, readable medium and electronic equipment
CN113010342A (en) * 2021-03-12 2021-06-22 北京百度网讯科技有限公司 Operation and maintenance diagnosis method, device, equipment and storage medium
CN113238956B (en) * 2021-05-31 2024-04-05 康键信息技术(深圳)有限公司 Fault analysis method, device, equipment and storage medium for abnormal application

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210103554A1 (en) * 2019-10-04 2021-04-08 Robin Systems, Inc. Rolling Back Kubernetes Applications Including Custom Resources

Also Published As

Publication number Publication date
CN113722050A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
Gulzar et al. Bigdebug: Debugging primitives for interactive big data processing in spark
US11288178B2 (en) Container testing using a directory and test artifacts and/or test dependencies
CN110187914B (en) Application development method, system and device
US7908521B2 (en) Process reflection
US6643802B1 (en) Coordinated multinode dump collection in response to a fault
US8726225B2 (en) Testing of a software system using instrumentation at a logging module
US10678677B1 (en) Continuous debugging
CN112083948B (en) Automatic construction and deployment method and tool based on data configuration
CN111324423B (en) Method and device for monitoring processes in container, storage medium and computer equipment
CN111930465A (en) Kubernetes-based dreams master-slave cluster deployment method and device
CN113722050B (en) Application diagnosis assistance method, computing device, and machine-readable storage medium
US11989539B2 (en) Continuous integration and deployment system time-based management
CN113094238A (en) Method and device for monitoring abnormity of business system
US11151020B1 (en) Method and system for managing deployment of software application components in a continuous development pipeline
US20080172669A1 (en) System capable of executing workflows on target applications and method thereof
CN117648257A (en) Web automatic test method and system under Linux operating system
van der Burg et al. Automated deployment of a heterogeneous service-oriented system
CN114691445A (en) Cluster fault processing method and device, electronic equipment and readable storage medium
CN108388440A (en) A kind of method that web application systems automatically update
US11366743B2 (en) Computing resource coverage
CN115617668A (en) Compatibility testing method, device and equipment
CN115080309A (en) Data backup system, method, storage medium, and electronic device
CN112579145A (en) Application deployment method and device
Zhou et al. Trace bench: An open data set for trace-oriented monitoring
Mertens Reengineering Theodolite with the Java Operator SDK

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40062669

Country of ref document: HK