CN117370128A

CN117370128A - Cloud monitoring and analyzing method and system

Info

Publication number: CN117370128A
Application number: CN202211172568.2A
Authority: CN
Inventors: 张晓磊; 王昊玄
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2022-06-30
Filing date: 2022-09-26
Publication date: 2024-01-09

Abstract

A cloud monitoring and analyzing method and system relate to the technical field of cloud. The method may include: acquiring configuration information input or selected by a tenant in a cloud management platform, wherein the configuration information is used for representing monitoring and analysis requirements of the tenant on a database, and the database is used for storing service data of a service system of the tenant; analyzing the target Structured Query Language (SQL) statement to be optimized according to the configuration information to obtain an analysis report, wherein the target SQL statement is a statement for operating the database historically. The method is helpful for quickly identifying the reason for the occurrence of the load abnormal event of the database so as to quickly restore the normal operation of the service.

Description

Cloud monitoring and analyzing method and system

The present application claims priority from the chinese patent application filed at 2022, 6 and 30, to the intellectual property office of the people's republic of China, application number 202210771078.8, entitled "a service providing method based on cloud technology and cloud management platform", the entire contents of which are incorporated herein by reference.

Technical Field

The application relates to the field of cloud technology, in particular to a cloud monitoring and analyzing method and system.

Background

As the scale of the cloud is continuously enlarged, the traffic volume and the traffic complexity of the load are continuously increased, and the requirements on the database supporting the operation of the traffic are also continuously increased. The requirements are not only in terms of performance, but also in terms of operation and maintenance capability matched with the database, especially when the database has performance bottleneck and the service architecture cannot be adjusted rapidly (such as database splitting and table splitting) temporarily, once the situation that the load of the database is high occurs, the concurrent source or service scene is rapidly identified, and the coping strategy is timely executed, so that the method is very important for maintaining stable operation of the service and reducing the service of the database.

The existing monitoring and analyzing means are relatively simple, and the problems of the monitoring function and the analyzing function are fractured, the analyzed dimension is not comprehensive enough and the like exist, so that a set of end-to-end monitoring and analyzing system is needed to quickly identify the reason for the abnormal load event of the database and quickly restore the normal operation of the service.

Disclosure of Invention

The embodiment of the application provides a cloud monitoring and analyzing method system which is beneficial to quickly identifying the reason for causing a load abnormal event to occur in a database so as to quickly restore normal operation of a service.

In a first aspect, embodiments of the present application provide a cloud monitoring and analysis method, which may be implemented by a cloud monitoring and analysis system, the method may include: acquiring configuration information input or selected by a tenant in a cloud management platform, wherein the configuration information is used for representing monitoring and analysis requirements of the tenant on a database, and the database is used for storing service data of a service system of the tenant; analyzing the target Structured Query Language (SQL) statement to be optimized according to the configuration information to obtain an analysis report, wherein the target SQL statement is a statement for operating the database historically.

By the method, the cloud monitoring and analyzing system performs cloud monitoring and analysis on the database by providing a novel service form so as to quickly identify the reason for causing the abnormal load event of the database and quickly restore the normal operation of the service.

With reference to the first aspect, in one possible design, the method further includes: and receiving the target SQL statement input or selected by the tenant in the cloud management platform.

By means of the method, the cloud monitoring and analyzing system supports tenant to input target SQL sentences to be optimized in a self-defined mode, and is beneficial to expanding application scenes of the cloud monitoring and analyzing system.

With reference to the first aspect, in one possible design, the analyzing, according to the configuration information, the target structured query language SQL statement to be optimized includes: and when the load abnormal event occurs in the database, analyzing the target SQL statement to be optimized according to the configuration information.

By the method, the cloud monitoring and analyzing system can provide cloud monitoring service for the database to detect the load abnormal event of the database.

With reference to the first aspect, in one possible design, the configuration information is used to indicate at least one monitoring indicator associated with the load abnormal event and an indicator threshold corresponding to the monitoring indicator, and the method further includes: acquiring index data of the at least one monitoring index; and when the index data is greater than or equal to the corresponding index threshold value, determining that the load abnormal event occurs in the database.

By the method, the cloud monitoring and analyzing system can receive the monitoring index input or selected by the user definition of the tenant, collect and detect data based on the monitoring index, and determine whether a load abnormal event occurs.

With reference to the first aspect, in one possible design, the monitoring indicator includes a system indicator and/or a service indicator, where the system indicator is at least one of: CPU load, memory load, disk read/write (IO) load, network packet loss rate or network delay of a computing device to which a service accessing the database belongs; the business index comprises at least one of the following: database total link number, database active connection number, table expansion rate, primary and backup synchronization rate, and slow SQL per minute.

It should be understood that the monitoring index is merely illustrated herein and not limited in any way, and in other embodiments, the monitoring index may be adjusted according to the service system or the application requirement, which is not described herein.

With reference to the first aspect, in one possible design, the configuration information is used to indicate at least one log analysis dimension of the database, and the method further includes: performing log analysis on a target log of the database from the at least one log analysis dimension; and determining a target SQL statement to be optimized according to an analysis result of the target log in the at least one log analysis dimension.

With reference to the first aspect, in one possible design, the at least one log analysis dimension includes: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL time delay distribution dimension, or a query rate per second QPS dimension of SQL.

With reference to the first aspect, in one possible design, the analyzing, according to the configuration information, the target structured query language SQL statement to be optimized to obtain an analysis report includes: recovering backup data packets of the database at the multi-engraving database node; and executing the target SQL statement at the rescheduling database node to obtain an analysis report.

With reference to the first aspect, in one possible design, the configuration information is further used to indicate information describing a load scenario of the database, and the executing, at the multi-carved database node, the target SQL statement to obtain an analysis report includes: and constructing the load scene at the multi-engraving database node according to the configuration information, and executing the target SQL statement at the multi-engraving database node to obtain an analysis report.

By the method, the cloud monitoring and analyzing system can provide a user with a self-defined configuration channel, so that tenants can input background parameters based on the self-definition of the channel, the cloud monitoring and analyzing system can construct a pressure background, SQL sentences are executed under the constructed pressure background, and the analysis efficiency of the cloud monitoring and analyzing system is improved.

With reference to the first aspect, in one possible design, the information describing the load scenario of the database includes at least one of: pressure schema name, service name, database name, concurrent number of SQL executions or SQL executed concurrently.

With reference to the first aspect, in one possible design, the analysis report includes at least one of: an analysis result of at least one log analysis dimension of the database; index data of at least one monitoring index of the database; the target SQL statement; the execution plan of the target SQL statement; or, the execution plan of the target SQL statement renders a graph.

With reference to the first aspect, in one possible design, the content of the analysis report is presented in at least one of the following manners: progress bars, percentages, pie charts, lists, line charts, or dashboards.

In a second aspect, embodiments of the present application provide a cloud monitoring and analysis system, including: the cloud management platform is used for receiving configuration information input or selected by a tenant, wherein the configuration information is used for representing monitoring and analysis requirements of the tenant on a database, and the database is used for storing service data of a service system of the tenant; and the analysis device is used for analyzing the target Structured Query Language (SQL) statement to be optimized according to the configuration information to obtain an analysis report, wherein the target SQL statement is a statement for operating the database historically.

With reference to the second aspect, in one possible design, the cloud management platform is configured to receive the target SQL statement input or selected by the tenant at the cloud management platform.

With reference to the second aspect, in one possible design, the analysis device is configured to: and when the load abnormal event occurs in the database, analyzing the target SQL statement to be optimized according to the configuration information.

With reference to the second aspect, in one possible design, the configuration information is used to indicate at least one monitoring indicator associated with the load abnormal event and an indicator threshold corresponding to the monitoring indicator, and the analyzing device is further configured to: acquiring index data of the at least one monitoring index; and when the index data is greater than or equal to the corresponding index threshold value, determining that the load abnormal event occurs in the database.

With reference to the second aspect, in one possible design, the monitoring index includes a system index and/or a service index, where the system index is at least one of the following: CPU load, memory load, disk read/write (IO) load, network packet loss rate or network delay of a computing device to which a service accessing the database belongs; the business index comprises at least one of the following: database total link number, database active connection number, table expansion rate, primary and backup synchronization rate, and slow SQL per minute.

With reference to the second aspect, in one possible design, the configuration information is used to indicate at least one log analysis dimension of the database, and the analysis device is further configured to: performing log analysis on a target log of the database from the at least one log analysis dimension; and determining a target SQL statement to be optimized according to an analysis result of the target log in the at least one log analysis dimension.

With reference to the second aspect, in one possible design, the at least one log analysis dimension includes: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL time delay distribution dimension, or a query rate per second QPS dimension of SQL.

With reference to the second aspect, in one possible design, the analysis device is configured to: recovering backup data packets of the database at the multi-engraving database node; and executing the target SQL statement at the rescheduling database node to obtain an analysis report.

With reference to the second aspect, in one possible design, the configuration information is further used to indicate information describing a load scenario of the database, and the analyzing device is configured to: and constructing the load scene at the multi-engraving database node according to the configuration information, and executing the target SQL statement at the multi-engraving database node to obtain an analysis report.

With reference to the second aspect, in one possible design, the information describing the load scenario of the database includes at least one of: pressure schema name, service name, database name, concurrent number of SQL executions or SQL executed concurrently.

With reference to the second aspect, in one possible design, the analysis report includes at least one of: an analysis result of at least one log analysis dimension of the database; index data of at least one monitoring index of the database; the target SQL statement; the execution plan of the target SQL statement; or, the execution plan of the target SQL statement renders a graph.

With reference to the second aspect, in one possible design, the content of the analysis report is presented in at least one of the following manners: progress bars, percentages, pie charts, lists, line charts, or dashboards.

In a third aspect, embodiments of the present application provide a cluster of computing devices, including at least one computing device, each computing device including a processor and a memory; the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any one of the possible designs of the first aspect or the first aspect described above.

In a fourth aspect, embodiments of the present application provide a computer program product comprising instructions which, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of the first aspect or any one of the possible designs of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform the method of the first aspect or any of the possible designs of the first aspect.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

FIG. 1 is a schematic diagram of a system architecture to which embodiments of the present application are applicable;

FIG. 2 shows a schematic diagram of a management interface for service registration according to an embodiment of the present application;

FIG. 3 shows a schematic diagram of an interface for managing SQL templates according to an embodiment of the application;

FIG. 4 shows a schematic diagram of a management interface for configuring monitoring metrics according to an embodiment of the present application;

FIG. 5 illustrates a schematic diagram of an interface for a management system index monitoring threshold template according to an embodiment of the present application;

FIG. 6 shows a schematic diagram of an interface for managing a database according to an embodiment of the present application;

FIG. 7 illustrates a schematic diagram of an interface for managing business metric monitoring threshold templates according to an embodiment of the present application;

8-10 illustrate diagrams of analysis results of at least one log analysis dimension of an embodiment of the present application;

FIG. 11 illustrates a schematic diagram of an interface for managing a stress test template according to an embodiment of the present application;

FIG. 12 illustrates a schematic diagram of an interface for configuring SQL parameters according to an embodiment of the application;

FIG. 13 shows a flow diagram of a cloud monitoring and analysis method of an embodiment of the present application;

FIG. 14 shows a flow diagram of a cloud monitoring and analysis method of an embodiment of the present application;

FIG. 15 shows a schematic diagram of an output interface of an embodiment of the present application;

FIG. 16 illustrates a schematic diagram of a computing device of an embodiment of the present application;

fig. 17-18 illustrate schematic diagrams of clusters of computing devices according to embodiments of the present application.

Detailed Description

In order to facilitate understanding of the technical solutions of the embodiments of the present application, some terms related to the embodiments of the present application will be described first.

1. Database (DB):

is a collection of data organized and stored in a secondary memory according to some data model. The data set has the following characteristics: as far as possible, the data structure of the application services of a specific organization in an optimal way is independent of the application program (such as a service system of a tenant) using the application services, and the addition, deletion, modification and retrieval of the data can be uniformly managed and controlled.

In the embodiment of the application, in order to ensure the stability of the system and the availability of the data, the database may adopt a primary-backup architecture, including a primary node and a backup node. The business system can write data into the master node of the database and query the data through the master node. The standby node only performs data backup under normal conditions, and only when the main node is down, the standby node can provide read-write service for the service system.

2. Replication database (fork_db):

fork is a clone of a database. Cloning one database allows for the free testing of various changes without affecting the original database.

In the embodiment of the application, the fork_db is used for acquiring and recovering the backup package of the database of the current network for testing.

3. Structured query language (structured query language, SQL):

is a database query and programming language for accessing data and querying, updating and managing relational database systems. SQL statements are a language in which databases are operated upon.

Wherein the structured query language is a high-level non-procedural programming language that allows a user to work on a high-level data structure. The language does not require the user to specify a data storage method and does not require the user to know a specific data storage mode, so that different database systems with completely different substructures can use the same structured query language as an interface for data input and management. SQL statements can be nested, and have great flexibility and powerful functions.

4. Object storage service (Object Storage Service, OBS):

the mass storage service based on the object can provide mass, safe, high-reliability and low-cost data storage capacity for users.

The basic components of an OBS are the OBS barrel and the object. OBS buckets are containers for storing objects in OBS, each bucket has its own storage category, access rights, area to which it belongs, and the user locates the bucket on the internet by its access domain name. The object is a basic unit of Data storage in OBS, and one object is actually an aggregate of Data and related attribute information of one file, and includes three parts, namely a Key value (Key), metadata (Metadata) and Data (Data).

5. Cloud resources:

cloud management platform provides cloud resources for tenants (users who purchase cloud resources), including cloud services such as virtual private cloud (Virtual Private Cloud, VPC) network providing services, gateway providing services, firewall services, network address translation (Network Address Translation, NAT) services, cloud disk, elastic public IP (EIP), cloud monitoring services, and cloud services provided by various cloud vendors, and cloud instances such as virtual machines, containers, or bare metal servers, which are virtual instances provided by cloud vendors for tenants at data centers of the cloud vendors. The product form of the cloud resource is not limited by the embodiment of the application.

6. Application programming interface gateway (application programming interface gateway, APIG):

the method is a very general mode in the micro-service architecture, and the APIG is used as a unified entry of the system, so that integration among all micro-services can be realized, and meanwhile, the method is friendly to a client and shields the complexity and the diversity of the system.

7. Cloud log service (cloud log service, CLS):

the one-stop log data solution is provided, so that the one-stop log data solution can enjoy all-round stable and reliable log service from log acquisition, log storage to log content searching, statistical analysis and the like, helps solve log problems such as business problem positioning, index monitoring, security audit and the like, and helps reduce log operation and maintenance thresholds.

8. SQL execution plan (explatin): description of a procedure performed in a database by an SQL statement.

The user can view the logical execution plan generated by the optimizer for a given SQL through an EXPLAIN command. If a performance problem of a certain SQL is to be analyzed, the execution plan of the SQL needs to be checked first to check whether each step of SQL execution has a problem. Reading an execution plan is a prerequisite for SQL optimization, while knowing the operator of an execution plan is critical for understanding expain commands.

The present application is described in detail below with reference to the accompanying drawings and examples.

Fig. 1 shows a schematic diagram of a system architecture to which embodiments of the present application are applicable.

Referring to fig. 1, a cloud monitoring and analysis system 100 and a tenant's business system 200 may be included in the system architecture.

Wherein, the business system 200 of the tenant can implement its own business by providing at least one service. The cloud monitoring and analysis system 100 may be connected to the tenant's business system 200, and may provide cloud monitoring services for the tenant's business system 200 to monitor and analyze the operation of the tenant's business system 200. In an alternative implementation manner, the service system 200 may be a cloud service system, and at least one service provided by the service system 200 may be a cloud service, which is not limited in the implementation manner of the service system 200 in this embodiment of the present application.

By way of example, the service system of the tenant may include a database, and the cloud monitoring and analysis system 100 may provide a cloud monitoring service for the database of the service system 200 of the tenant, so as to quickly identify an abnormal cause when a load abnormal event occurs in the database by monitoring and analyzing a load condition of the database, so as to quickly restore normal operation of the service. The cloud monitoring and analysis system 100 may include, for example, but is not limited to, the following functional modules: registration device 110, monitoring device 120, log device 130, log analysis processing cluster 140, SQL statement analysis device 150, replay database node 160, and cloud management platform 170. The tenant's business system 200 may include, but is not limited to, the following functional modules: APIG 210, service device 220, database 230, object storage service (OBS) 240, and internal management service 250.

In this embodiment of the present application, the functional modules included in each of the cloud monitoring and analysis system 100 and the service system 200 of the tenant may be different according to at least one service specifically provided by the service system 200 of the tenant or a cloud monitoring service specifically provided by the cloud monitoring and analysis system 100, which is not limited in this embodiment of the present application. Moreover, the cloud monitoring and analysis system 100 and the service system 200 of the tenant may each include a functional module, which may be implemented by software or may be implemented by hardware. Illustratively, an implementation of the monitoring device 120 device is described next. Similarly, the implementation of the registration device 110, the log device 130, the log analysis processing cluster 140, the SQL statement analysis device 150, the replay database node 160, and the cloud management platform 170, the APIG 210, the service device 220, the database 230, the object storage service 240, and the internal management service 250 may refer to the implementation of the monitoring device 120.

Module as an example of a software functional unit, the monitoring device 120 may include code that runs on a computing instance. Wherein the computing instance may be at least one of a physical host (computing device), a virtual machine, a container, etc. computing device. Further, the computing device may be one or more. For example, the monitoring device 120 may include code running on multiple hosts/virtual machines/containers. It should be noted that, multiple hosts/virtual machines/containers for running the application may be distributed in the same region (region), or may be distributed in different regions. Multiple hosts/virtual machines/containers for running the code may be distributed in the same availability zone (availability zones, AZ) or may be distributed in different AZs, each AZ comprising a data center or multiple geographically close data centers. Wherein typically a region may comprise a plurality of AZs. Also, multiple hosts/virtual machines/containers for running the code may be distributed in the same virtual private cloud (virtual private cloud, VPC) or in multiple VPCs. Where typically one VPC is placed within one region. The inter-region communication between two VPCs in the same region and between VPCs in different regions needs to set a communication gateway in each VPC, and the interconnection between the VPCs is realized through the communication gateway. Modules as an example of hardware functional units, the monitoring apparatus 120 may include at least one computing device, such as a server or the like. Alternatively, the monitoring apparatus 120 may be a device implemented by an application specific integrated circuit (application specific integrated circuit, ASIC), or an editable logic device (programmable logic device, PLD), or the like. The PLD may be a Complex PLD (CPLD), a field programmable gate array (field programmable gate array, FPGA), a general-purpose array logic (generic array logic, GAL), or any combination thereof. The plurality of computing devices included in the monitoring apparatus 120 may be distributed in the same region or may be distributed in different regions. The plurality of computing devices included in the monitoring apparatus 120 may be distributed in the same AZ or may be distributed in different AZ. Likewise, the plurality of computing devices included in the monitoring apparatus 120 may be distributed in the same VPC or may be distributed in a plurality of VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.

For ease of understanding, the following describes in detail, taking an example of providing a cloud monitoring service to a database of a service system, with reference to the drawings and the embodiments, the functions of each functional module included in each of the cloud monitoring and analysis system 100 and the service system 200 of the tenant shown in fig. 1.

1. Tenant's business system 200:

in this embodiment of the present application, the service system 200 of the tenant may include, but is not limited to, the following functional modules: APIG 210, service device 220, database 230, object storage service 240, and internal management service 250.

The APIG 210 is a communication interface between the service system 200 of the tenant and the electronic device of the tenant, where the electronic device of the tenant may be connected with at least one service provided by the service system 200 through the APIG 210, and the at least one service may cooperatively implement a service.

Service device 220 of tenant's business system 200 may provide at least one service to the tenant, represented, for example, as service 221, service 222, service 223, and so on.

In the course of the at least one service implementing the business, the at least one service may interact with the master node 231 of the database 230 to store the generated service data to the master node 231 of the database 230 or to read data required for the relevant service from the master node 231. Communication interaction can be performed between the main node 231 and the standby node 232 of the database 230 to backup data stored in the main node 231 to the standby node 232, and when the main node 231 fails, the standby node 232 can replace the interaction between the main node 231 and at least one service of the service device 220 to provide a data storage service for the at least one service so as to support service operation.

The object storage service 240 may be used in conjunction with the database 230, the object storage service 240 may store backup packages of the database 230 without consideration of capacity limitations during use, and may provide multiple storage types for selection, capable of satisfying various business scenario requirements of tenants.

The internal management service 250 may be coupled to the APIG 210 for implementing internal management of the tenant's business system 200 by the APIG 210, including, but not limited to, recruitment, job entry, job departure, personnel management, IT service management, etc. of the tenant enterprise.

It should be noted that fig. 1 is only an exemplary illustration of the functional modules of the service system 200 in the embodiment of the present application, and in other embodiments, the devices or functional modules required to be used in the service system 200 are modified according to specific application scenarios or service requirements, which are not described herein.

2. Cloud monitoring and analysis system 100:

in this embodiment of the present application, the cloud monitoring and analysis system 100 may be connected to the service system 200 of the tenant, and may provide a cloud monitoring service and an analysis service for the service system 200 of the tenant to monitor the operation condition of the service system 200 of the tenant, analyze the cause of the abnormal load event of the service system 200, and so on, so as to quickly restore the normal operation of the service system 200.

By way of example, the cloud monitoring and analysis system 100 may include, for example, but not limited to, the following functional modules: registration device 110, monitoring device 120, log device 130, log analysis processing cluster 140, SQL statement analysis device 150, replay database node 160, and cloud management platform 170. It should be understood that in the embodiment of the present application, the console (console) management interface is an example of a communication interface provided by the cloud management platform for the tenant, and is not limited in this embodiment, in other embodiments, there may be multiple variations of the communication interface. Also, devices of the cloud monitoring and analysis system 100 other than the cloud management platform 170 may be collectively referred to as analysis devices.

(1) Registration apparatus 110:

in this embodiment of the present application, the registration device 110 may be configured to implement a management function for a service (for example, the service 221, the service 222, the service 223, etc.) accessing the database 230 and a management function for a business scenario of the service, and may be configured to collect and record a correspondence between the business scenario of the service and a corresponding database SQL template.

Among other things, the management functions of the service may include managing information on the service connected to the cloud monitoring and analysis system 100, including, for example, registration, update, and deregistration of information on the service, and the like. The tenant or the service system 200 of the tenant may register the service with the registration apparatus 110, and the information to be provided may include, but is not limited to, the following information of the service: product department, service domain, service name, micro-service name, database accessed, service responsible person, service team member, etc. When the related information of the registered service needs to be updated, the registration device 110 may receive and store the updated service information from the service system 200 of the tenant. When a registered service needs to be logged off, the registration device 110 may delete relevant registration information of the service after receiving a logoff instruction from the service system 200.

The management functions of the service scenarios include managing and maintaining various service scenarios that may be involved in the service, and the correspondence of the various service scenarios to the SQL templates that the service scenarios execute in the database 230, such as one-to-one relationships, one-to-many relationships, and the like.

In an alternative embodiment, the registration device 110 may provide visual registration and management functions for tenants through a cloud management platform. The tenant may register, based on the interface definition of service registration, the service of the service system 200 that implements the service and needs to access the database 230 to the cloud monitoring and analysis system 100 through the electronic device and the corresponding console management interface of the tenant, so that when the cloud monitoring and analysis system 100 provides the cloud monitoring service to the service system 200 of the tenant, the cloud monitoring and analysis system can detect the abnormal load event of the database 230, quickly analyze and identify the concurrent source or service scenario, timely execute the coping strategy, maintain the stable operation of the service and reduce the service of the database.

For example, as shown in fig. 2, in the console management interface provided by the registration device 110, the tenant may register the service accessing the database 230 with the cloud monitoring and analysis system 100 by binding the service scenario related to the service to be registered with the SQL template, and register the correspondence between the various service scenarios implemented by the service and the SQL template. The tenant may add the SQL template for the service scenario by clicking a "add" button, or delete the selected SQL template by clicking a "delete" button. In the case that the tenant clicks the "add" button, an interface for adding the SQL template as shown in fig. 3 may be suspended on the presented console management interface, and the tenant may input or select a corresponding parameter in the relevant attribute configuration item provided by the interface, so as to perform binding (or called association) between the business scenario and the SQL template.

As shown in fig. 3, the attribute configuration items available for tenant configuration may include, but are not limited to: a service attribute for indicating a service associated with the business scenario, such as an elastic cloud server (elastic cloud server, ECS); a micro-service attribute for indicating a micro-service, e.g., nova, associated with the business scenario; a service scenario attribute for indicating a description service scenario, such as node specification (flag) management; interface attributes for indicating interfaces associated with the business scenario, such as query node specifications; an interface description attribute for describing the purpose of the interface associated with the service scene, such as inquiring the specification information of the node; related SQL operation attributes, which are used to indicate the SQL template, may include SQL attributes, SQL description attributes, and the like. The tenant may add or subtract related SQL operations of the newly added SQL template by clicking the plus sign "+" button or the minus sign "-" button. After the configuration of the related attributes of the newly added SQL template is completed, the tenant can finish the configuration by clicking a 'confirm' button. If the tenant relinquishes the configuration, it may relinquish by clicking a "cancel" button. After the tenant confirms the configuration, the registered business scenario and the SQL template associated with the business scenario are presented on the console management interface as shown in FIG. 2. In an alternative embodiment, the tenant may view registration information of each registered service through the console management interface. If necessary, the tenant may modify the registration information of the service scenario and the corresponding relationship between the service scenario and the SQL template through a "modify" button, and the modification interface is similar to the interface shown in fig. 3, which may be referred to each other, and will not be described herein.

In the interface diagrams shown in fig. 2 to 3, only the service scenario of the ECS and the bound SQL template are taken as examples, and the management of the service accessing the database 230, the management of the service scenario of the service, and the management of the correspondence between the service scenario of the service and the SQL template are illustrated but not limited. In practical applications, the tenant may combine its own service system or application requirement, register a service accessing the database 230 in the cloud monitoring and analysis system 100 through a management interface similar to that of fig. 2 or fig. 3, and implement management on the service and management on a service scenario, and the corresponding attribute configuration item may also be adjusted according to its own service system or application requirement, which is not described herein again.

(2) Monitoring device 120:

in this embodiment of the present application, the monitoring device 120 is configured to collect, according to the preconfigured information, monitoring information (or referred to as index data) generated by monitoring the service of cloud monitoring provided by the cloud monitoring and analysis system 100 by aiming at the service system 200 of the tenant, and display the collected monitoring information and provide an automatic analysis capability.

Illustratively, key metrics that cloud monitoring service monitors for tenant's business system 200 may include two categories: system metrics and business metrics. Taking the database as an example, the target object to be monitored may include, for example, a CPU load, a memory load, a disk read/write (IO) load, a network packet loss rate, a network delay (ping), and the like of a computing device to which a service accessing the database belongs. The traffic metrics may include the total number of links to the database, the number of active connections to the database, the table expansion rate, the primary and backup synchronization rate, the number of slow SQL per minute, etc.

In an alternative embodiment, the monitoring device 120 may provide visual configuration and management functions for the tenant through a cloud management platform. The tenant may configure cloud resources and monitoring indexes to be monitored through its terminal device and corresponding console management interface based on interface definition of the cloud monitoring service, so that the monitoring device 120 may monitor the operation condition of a service system (e.g. database) of the tenant in real time or periodically, so as to provide an automatic analysis capability when a load abnormal event (e.g. load surge) occurs in the database, help to quickly identify a concurrent source or service scenario, and timely execute a coping strategy, maintain stable operation of the service and reduce service of the database.

For example, a load anomaly event may be detected by configuring a database with a relevant index monitoring threshold, and when a relevant index associated with the database is detected to be greater than or equal to the configured index threshold, the load anomaly event is deemed to occur.

As shown in fig. 4, taking an example of configuring a system index monitoring threshold template for a database at a console management interface, in the console management interface, a tenant may also add a system index monitoring threshold template by clicking a "add" button, or delete a selected system index monitoring threshold template by clicking a "delete" button.

In the case that the tenant clicks "add" the "add system index monitoring threshold template" shown in fig. 5 may be suspended on the console management interface, and similarly to the configuration interface shown in fig. 3, the tenant may input or select a corresponding parameter in the relevant attribute configuration item provided by the interface shown in fig. 5, so as to configure the system index monitoring threshold template. The relevant attribute configuration items can include, for example, threshold configuration items of key indicators for CPU, memory, IO, network packet loss rate, and network delay waiting monitoring. Similarly, after the tenant clicks the "confirm" button to complete configuration, parameters of the corresponding system index monitoring threshold template, such as a template identifier (Identity document, ID) (e.g. 001), a CPU load threshold (e.g. 0.9) corresponding to the template, a memory load threshold (e.g. 0.9), an IO load threshold (e.g. 0.8), a network packet loss rate threshold (e.g. 0.2), a creator (e.g. Zhang san 00123456), and related operation buttons (e.g. binding, modification, etc.), are presented on the console management interface shown in fig. 4. If necessary, the tenant may modify the relevant parameters of the template by clicking a "modify" button associated with the system index monitoring threshold template, and the modification interface is similar to the interface shown in fig. 5, which may be referred to each other, and will not be described herein.

When the tenant clicks the "binding" button of the template for a specific system index monitoring threshold template, a configuration interface of a service scene of a service to be associated with the system index monitoring threshold template can be suspended on the console management interface, and the tenant can input or select corresponding parameters in related attribute configuration items provided by the interface so as to associate a target object to be monitored with the selected system index monitoring threshold template.

Taking the database as an example, the target object to be monitored is an "associated database information" interface as shown in fig. 6, the relevant attribute configuration items may include, for example, an area to which the database belongs, AZ, a data center product (performance optimization datacenter, POD), a database name, a database master node IP, and the like. Similarly, the tenant may enter or select corresponding parameters for configuration in each attribute configuration item. After the tenant clicks the "confirm" button to complete configuration, database information associated with the system index monitoring threshold template is presented on the interface shown in fig. 4, such as an area to which the database belongs (e.g., north-beijing four), AZ (e.g., AZ 1), POD (e.g., POD 15), database name (e.g., gaussdb nova), database master node IP (e.g., 10.77.24.177), and related operation buttons (e.g., bind, unbind, modify), etc. If desired, the tenant may modify the database information by clicking on the "modify" button, or modify the template associated with the database by clicking on the "unbind" button or the "bind" button. For example, a "modify" button may be used to modify the bound database information, an "unbind" button may be used to delete the template bound database information, and a "bind" button may be used to add the database information to be bound for the template. It should be understood that this is merely an example and not any limitation, and in practical application, the tenant may adjust the relevant attribute configuration mode or attribute configuration item according to its own service system or application requirement, which is not described herein.

Similar to configuring a system index monitoring threshold template for a database, the console management interface shown in fig. 4 may also be used to configure a business index monitoring threshold template (not shown in the figure) for the database, where the configuration interface of the business index monitoring threshold template may be as shown in fig. 7, and a tenant may input or select a corresponding parameter in a relevant attribute configuration item provided by the interface to configure the business index monitoring threshold template. The relevant attribute configuration items may include, for example, the total number of links of the database, the number of active links, the table expansion rate, the primary and backup synchronization rate, the slow SQL per minute, etc. The tenant may input or select corresponding parameters at various attribute configuration items provided at the interface to configure, such as total number of links (e.g., 2000), active number of links (e.g., 85), table expansion rate (e.g., 0.3), primary-backup synchronization rate (e.g., 0.2), slow SQL per minute (e.g., 100). After the tenant clicks the "confirm" button to complete configuration, parameters of the corresponding business index monitoring threshold template are presented on the interface shown in fig. 4 (not shown in fig. 4). If desired, the tenant may also modify the relevant parameters of the business metric monitoring threshold template by clicking a "modify" button associated with the template at the interface shown in fig. 4. Similar configuration details may be found in the related description above in connection with fig. 4-6, and are not repeated here.

Based on the interface diagrams shown in fig. 4-7, after the tenant completes the functional configuration of the monitoring device 120, the monitoring device 120 can collect various index data of the object to be monitored (such as a database) and display the collected monitoring information and provide an automatic analysis capability in the process that the service system 200 of the tenant realizes the service.

The automatic analysis capability refers to that a threshold value can be set for one or a group of key indexes of interest, when the value of the collected indexes exceeds a preset threshold value, the monitored object is considered to have a load abnormal event, then the call to the log analysis processing cluster 140 can be triggered to count SQL in a time range near the moment (simply referred to as an abnormal moment) when the load abnormal event occurs, and according to a preconfigured log analysis dimension, SQL statistical results and an SQL execution plan are output so as to quickly identify a concurrent source or service scene, timely execute a coping strategy, maintain stable operation of the service and reduce service of a database. The automated analysis capability will be described in detail below in describing the functionality of the log analysis processing cluster 140, and is not described in detail herein.

It should be noted that, in the embodiments of the present application, the above description related to fig. 4 to fig. 7 is merely an exemplary illustration of the system index monitoring threshold template and the business index monitoring threshold template of the database, and is not limited in any way. In other embodiments, the system index or the service index may be changed according to a specific application scenario or service requirement, which is not limited in the embodiments of the present application. In addition, when the threshold template is monitored for the specific service related system index or the threshold template is monitored for the service index, the specific execution sequence of the configuration process is not limited to the above description, for example, in the case of previously specifying the database (or other services), the threshold template is monitored for the database related system index or the threshold template is monitored for the service index, which is not described herein again.

(3) Log means 130:

in the embodiment of the present application, the log device 130 is configured to collect log records of a target object (for example, the database 230) to be monitored. The log device 130 may be coupled to the log analysis processing cluster 140 and may provide the collected log records to the log analysis processing cluster 140 so that the log analysis processing cluster 140 performs log analysis.

(4) Log analysis processing cluster 140:

in the embodiment of the present application, the log analysis processing cluster 140 is a cluster node that performs log analysis capabilities. The log analysis processing cluster 140 may include a plurality of nodes, represented for example as node 01, node 02, node 03, and so on. At least one node of the plurality of nodes can cooperate according to application scenes, service requirements and the like to cooperatively execute related log analysis steps, and the partition mode of the execution nodes of the specific log analysis steps is not limited in the embodiment of the present application.

Wherein the log analysis processing cluster 140 may be connected to the registration apparatus 110 to obtain registered service information of the business system 200 from the registration apparatus 110. The log analysis processing cluster 140 may be connected to the monitoring device 120 to learn object information to be monitored from the monitoring device 120. The log analysis processing cluster 140 may be coupled to the log device 130 to obtain log records from the log device 130 to be analyzed. Taking the target object to be monitored as a database as an example, the log analysis processing cluster 140 can determine the range of log records to be analyzed through the following information: areas, PODs, patterns (schemas), database master nodes IP, databases, log analysis timeframes (including start time, end time, etc.), SQL templates, belonging services, belonging components, service nodes IP, etc.

In particular implementations, in an alternative embodiment, log analysis processing cluster 140 may provide visualized configuration and management functions for tenants through cloud management platform 170. The tenant may configure, through the electronic device of the tenant and the corresponding console management interface, a log analysis task to indicate a range of log records to be analyzed, based on an interface definition of log analysis, so that the log analysis processing cluster 140 may automatically obtain log records in the range from the log device 130, and analyze the log records in the range to obtain a log analysis result.

In another alternative embodiment, the log analysis processing cluster 140 may be connected to and invoked by other devices to provide automated analysis capabilities for the other devices. For example, taking an example in which the monitoring device 120 invokes the log analysis processing cluster 140 to perform an automated analysis, the monitoring device 120 may invoke the log analysis processing cluster 140 when it is determined that a load anomaly event (e.g., a value of a system index/business index or indexes exceeds a preset threshold) occurs by monitoring the relevant index of the service device 220 or the database 230. The log analysis processing cluster 140 may learn index anomaly information, such as an anomaly index, an anomaly time range, etc., associated with the load anomaly event of the service device 220 or the database 230 through communication interaction with the monitoring device 120. The log analysis processing cluster 140 may be a SQL template or the like that learns business scenarios registered with the cloud monitoring and analysis system 100, services accessing a database, business scenario/service associations through communication interactions with the registration device 110. In the case that the index anomaly information of the load anomaly event is known, the log analysis processing cluster 140 may determine a traffic scenario, service, SQL template, etc. associated with the load anomaly event. Further, the log analysis processing cluster 140 may acquire log records to be analyzed from the log device 130 based on the determined business scenario, service, SQL template, abnormal time range, etc., and perform an automated analysis process on the obtained log records, to obtain log analysis results.

It should be appreciated that the above is an example illustration of the operation timing of the log analysis processing cluster 140, and not any limitation, and in other embodiments, the log analysis processing cluster 140 may be triggered to perform the log analysis processing function in other manners, which is not described herein.

In an alternative embodiment, log analysis processing cluster 140 may provide visualized configuration and management functions for tenants through cloud management platform 170. The tenant may configure the log analysis processing capabilities of the log analysis processing cluster 140, including but not limited to configuring log analysis dimensions, the number of SQLs to be analyzed, etc., through its terminal device and corresponding console management interfaces based on the interface definition of the log analysis processing function. Cloud management platform 170 may also provide a visual presentation interface for tenants on which log analysis results may be presented.

Illustratively, the at least one log analysis dimension may include: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL latency distribution dimension, or a query-per-second (QPS) dimension of SQL. The log analysis dimension of the tenant custom configuration may be as shown in table 1 below:

TABLE 1

Wherein the "view type" list represents statistics of various log analysis dimensions, including but not limited to: performing time-consuming sequencing statistics by single SQL; a certain class of SQL execution total time consuming sequencing; ordering SQL execution frequencies of a certain class; SQL and service duty ratio statistics; SQL and host duty ratio statistics; SQL time delay distribution statistics; QPS statistics of SQL. The "detailed description" column is used to describe metrics associated with SQL execution operations that are related to the respective log analysis dimension, including, but not limited to, SQL execution time (including total time, average time, shortest time, longest time, etc.), SQL associated business scenarios, SQL templates, SQL amounts, SQL execution times, etc. The "presentation type" column is used to indicate the presentation manner of the relevant statistics, including, but not limited to, tables, pie charts, bar charts, line charts, etc., which are not limited in this embodiment of the present application.

It should be understood that, in the embodiment of the present application, the log analysis dimension may be input by the tenant through the visual configuration interface provided by the cloud management platform 170, and the configuration manner is the same as or similar to the configuration manner described above in connection with fig. 2 to fig. 7, and detailed implementation may refer to the related description above, which is not repeated herein. The at least one log analysis dimension may be modified according to changes in the business system or application requirements, and will not be described in detail herein.

For presentation of log analysis results, illustratively, taking single SQL execution time-consuming ordering statistics as an example, as shown in fig. 8, on a visual output interface provided by the cloud management platform 170, information characterizing the scope of the log records being analyzed may be presented, such as an area (e.g., north-beijing four), POD (e.g., POD 15), mode (e.g., standard schema), database (e.g., gauss db_nova), database master node IP (e.g., 10.77.24.177), log analysis time scope (including start time, end time, near 30 minutes, etc.), SQL template (not fully shown in the figure), affiliated services (e.g., ECS), affiliated components (e.g., nova), service node IP (e.g., service source IP, not fully shown in the figure), etc. Analysis statistics of individual SQL executions involved in this scope may be presented in a list form, including, for example, SQL ID, SQL template, business scenario, total time consumption, SQL execution times, shortest time consumption (ms), longest time consumption (ms), average time consumption (ms), etc.

Alternatively, taking the statistics of the SQL execution traffic duty ratio and the statistics of the SQL connection IP duty ratio as an example, as shown in fig. 9, on the visual output interface provided by the cloud management platform 170, information characterizing the range of the log records to be analyzed may be presented, such as an area (e.g., north-beijing four), POD (e.g., POD 15), mode (e.g., standard schema), database (e.g., gauss db_nova), database master node IP (e.g., 10.77.24.177), log analysis time range (including start time, end time, near 30 minutes, etc.), SQL template (not shown in the figure), belonging service (e.g., ECS), belonging component (e.g., nova), service node IP (e.g., service source IP, not shown in the figure), etc. The SQL execution service duty statistics involved in this range, SQL connection IP duty statistics, may be presented in the form of pie charts. For example, SQL execution service duty cycle statistics may include ECS service duty cycle 56%, EVS service duty cycle 20%, laasdepth service duty cycle 14%, VPC service duty cycle 10%. For another example, the SQL link IP duty cycle statistics may include ECS: 26.22.240.32%, EVS: 26.22.240.15%, laasdepth 26.22.240.21%, VPC: 26.22.240.23% and so on.

Alternatively, taking the SQL latency distribution statistics and the QPS statistics of the SQL as an example, as shown in fig. 10, on a visual output interface provided by the cloud management platform 170, information characterizing the range of the log records being analyzed may be presented, such as an area (e.g., north-beijing four), POD (e.g., POD 15), a schema (e.g., standard schema), a database (e.g., gaussdb_nova), a database master node IP (e.g., 10.77.24.177), a log analysis time range (including start time, end time, near 30 minutes, etc.), an SQL template (not shown in the figure), a belonging service (e.g., ECS), a belonging component (e.g., nova), a service node IP (e.g., service source IP, not shown in the figure), etc. The SQL delay profile statistics and SQL QPS statistics involved in this range may be presented in the form of a line graph.

The interface diagrams shown in fig. 8 to 10 are merely exemplary illustrations of the presentation manners of the log analysis results, and are not limited to any particular embodiments. In other embodiments, the tenant or the cloud monitoring and analysis system 100 may change log analysis dimensions according to application scenarios or service requirements, and the indexes related to each log analysis dimension and the presentation manners of different analysis results are not described herein.

It should be noted that, in the embodiment of the present application, the log analysis processing cluster 140 may periodically analyze the log records collected by the log device 130 and display the analysis result. Alternatively, the log analysis processing cluster 140 may analyze and present the analysis results for log records specified by the log analysis task upon receiving the log analysis task from the cloud management platform 170. Alternatively, the log analysis processing cluster 140 may automatically trigger analysis of log records and presentation of analysis results in a time range around the abnormal moment when the monitoring apparatus 120 detects a load abnormal event of the service system. The embodiment of the application does not limit the time or the triggering mode of log analysis.

(5) SQL statement analysis 150:

in the embodiment of the application, the SQL statement analyzing device 150 can provide online SQL analysis and testing capability. The SQL statement analysis device 150 may be connected to the log analysis processing cluster 140, and after the log analysis processing cluster 140 analyzes the SQL to be optimized, obtain an SQL execution plan so as to determine the optimization direction.

In order to improve the efficiency and accuracy of the on-line analysis and testing of SQL, the SQL statement analysis device 150 may have SQL analysis and testing function components, background pressure build management function components and data management function components as shown in FIG. 1. The SQL analysis and test function component, the background pressure build management function component, and the data management function component can provide visual configuration and management functions for tenants through the cloud management platform 170. The tenant can provide configuration information for the SQL analysis and test function component, the background pressure construction management function component and the data management function component through the terminal equipment and the corresponding console management interfaces based on interface definition of the SQL statement analysis function. For example, a tenant may provide SQL optimized attribute configuration items to SQL analysis and test function components through cloud management platform 170, or input background stress policies (including stress test templates) to background stress build management function components, or provide data synchronization commands, data synchronization cycles, etc. to data management function components.

The data management function component is configured to obtain and restore backup data of the current network of the service system 200 from the OBS to the replication (fork) database node 160 for SQL analysis and testing. The default policy of the data management function component may perform data synchronization once a week, and may also support manual triggering of data synchronization, which is not limited in this embodiment of the present application.

The SQL analysis and test function component is responsible for executing the SQL to be analyzed on the repeated database node 160, analyzing the execution plan of the SQL statement, and rendering the analysis result of the SQL execution plan. The functions provided by the SQL analysis and test function component may include, for example: (1) testing is time consuming to concurrently execute SQL. (2) The execution plan of SQL is analyzed under background pressure (optional settings) and the analysis results are rendered.

The background pressure construction management function component is used for managing and constructing various load scenes of the database, such as concurrent inquiry, concurrent update and the like, and is responsible for the management of a background pressure mode and the injection of background pressure parameters. The dimensional information that needs to be provided by the setting of the background pressure mode may include, but is not limited to, at least one of the following: pressure schema name, service name, database name, concurrent number of SQL executions or SQL executed concurrently.

For example, taking a tenant as an example of providing a background pressure policy on a console management interface, when a new background pressure policy is needed, the interface as shown in fig. 11 is suspended on the console management interface, and the background pressure policy is newly added through a new pressure test template. The stress test templates may include, for example, a schema name, a service name, a database master node IP, and related stress test parameters (including parameters and stress values) to which the background stress policy relates. After the tenant clicks the "confirm" button, the cloud management platform 170 provides the tenant entered or selected context pressure policy to the context pressure build management functionality.

Taking the example that the tenant inputs the SQL optimization parameters in the console management interface, in the case that the SQL analysis and test are required to be executed, an interface as shown in fig. 12 can be suspended in the console management interface, and relevant attribute configuration items of the interface can include, for example, a database name, a database master node IP, a pressure mode option, an SQL sentence input box and the like. The tenant can input or select corresponding parameters in each attribute configuration item, and after the tenant finishes selecting and clicks the "execute" button, the cloud management platform 170 issues the parameters input or selected by the tenant in the interface to the multi-carved database node, so that the SQL can be analyzed and tested on line under real data, and the analysis efficiency is improved.

It should be noted that fig. 11 to fig. 12 are only exemplary illustrations of the functional configuration of the SQL statement analysis device 150 in the embodiment of the present application, and are not limiting. In some embodiments, the console management interface may implement linkage between the SQL statement analysis device 150 and the registration device 110, the monitoring device 120, the log device 130, and the log analysis processing cluster 140, and support automated and customized SQL analysis capabilities, and may automatically perform SQL statement analysis based on relevant information of the registration device 110, the monitoring device 120, the log device 130, and the log analysis processing cluster 140, so as to reduce the operation and maintenance burden of the cloud monitoring and analysis system 100.

For example, a tenant may configure at the stage of registering a business scenario or service of the business system, i.e., for each functional module of the cloud monitoring and analysis system 100, including, but not limited to, configuring monitoring metrics, monitoring metrics thresholds, log analysis dimensions, SQL templates for business scenario/service association, background pressure policies for SQL templates association, and so forth. The monitoring device 120 may notify the log analysis processing cluster 140 after detecting a load anomaly event. The log analysis processing cluster 140 may automatically execute a log analysis procedure in response to a load exception event to comb out SQL statements that need to be optimized. After log analysis processing cluster 140 performs log analysis and carding out the SQL statement to be optimized, the SQL statement to be optimized can be provided to SQL statement analysis device 150 by calling related console management interface, and SQL statement analysis device 150 can automatically execute the execution plan of the SQL statement and obtain the execution result of the SQL statement and the rendered execution plan. The cloud management platform 170 may provide a visual interface to expose the execution results of the SQL statement and the rendered execution plan. Details of the implementation are similar to those described above, and reference may be made to the related descriptions of fig. 2-10, which are not repeated here.

(6) The replication database node 160:

in the embodiment of the present application, the multi-engraving database node 160 is used for recovering the backup data of the current network of the database 230 of the service system 200 for the analysis and test of the SQL statement.

(7) Cloud management platform 170:

in this embodiment, the cloud management platform 170 may provide a customized channel for a tenant by providing a related console management interface, so that the tenant may perform functional configuration on each functional module of the cloud monitoring and analysis system 100 through the console management interface, so that each functional module of the cloud monitoring and analysis system 100 may perform monitoring and automatic analysis in coordination with monitoring and analysis requirements according to the tenant, and feedback an analysis result to the tenant.

With reference to the related description in the foregoing with reference to fig. 2 to fig. 12, the console management interface may provide a customized channel for the tenant by providing a visual interface, so that the tenant may perform functional configuration on the related interface and view related information.

It is understood that the console management interface is merely an example illustration of the manner in which embodiments of the present application are configured and not in any way limiting. In some embodiments, the cloud monitoring and analysis system 100 may also provide a customized configuration channel for tenants based on the API format, for example. For example, the cloud monitoring and analysis system 100 may display the API format on an Internet-provided web page and note the usage of the corresponding fields. After seeing the corresponding API format, the tenant inputs the corresponding parameters according to the API format to complete configuration. The electronic device of the tenant may send the API with the parameters input to the cloud monitoring and analysis system 100 through the internet in a template manner, where the cloud monitoring and analysis system 100 detects the parameters corresponding to different fields in the API, so as to obtain requirements of the tenant corresponding to different fields of the API. Therefore, in the embodiment of the present application, the configuration information of the tenant for the cloud monitoring and analysis system 100 may further include an API field and parameters input by the tenant, and further, the cloud monitoring and analysis system 100 may store the configuration information of the tenant into a corresponding memory, so as to complete the functional configuration of the relevant functional module of the cloud monitoring and analysis system 100 by acquiring the configuration information from the memory, if necessary, to provide the cloud monitoring service for the service system 200 of the tenant.

A cloud monitoring and analysis method implemented based on the system architecture shown in fig. 1 is described below in conjunction with a method flowchart.

As shown in fig. 13, the cloud monitoring and analysis method may include the steps of:

s1310: the cloud management platform acquires configuration information input or selected by a tenant at the cloud management platform, wherein the configuration information is used for representing monitoring and analysis requirements of the tenant on a database, and the database is used for storing service data of a service system of the tenant.

For example, the configuration information may be used to indicate at least one monitoring indicator associated with the load abnormal event and an indicator threshold corresponding to the monitoring indicator. The monitoring index comprises a system index and/or a service index, and at least one of the following system indexes is adopted: CPU load, memory load, disk read/write (IO) load, network packet loss rate or network delay of a computing device to which a service accessing the database belongs; the business index comprises at least one of the following: database total link number, database active connection number, table expansion rate, primary and backup synchronization rate, and slow SQL per minute.

Alternatively, the configuration information may be used to indicate at least one log analysis dimension of the database. The at least one log analysis dimension may include, for example: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL time delay distribution dimension, or a query rate per second QPS dimension of SQL.

Alternatively, the configuration information may be used to indicate information describing the load scenario of the database. The information describing the load scenario of the database comprises, for example, at least one of the following: pressure schema name, service name, database name, concurrent number of SQL executions or SQL executed concurrently.

In the implementation S1310, the cloud management platform may refer to the interface diagrams shown in fig. 2-12, and the detailed implementation details of receiving the relevant configuration information input or selected by the tenant in the cloud management platform may refer to the relevant descriptions in conjunction with fig. 2-12, which are not repeated herein.

S1320: the analysis device analyzes the target Structured Query Language (SQL) statement to be optimized according to the configuration information to obtain an analysis report, wherein the target SQL statement is a statement for operating the database historically. The analysis device is a back-end device connected to the cloud management platform, and may include all devices of the cloud monitoring and analysis system 100 in fig. 1 except the cloud management platform 170.

Before implementing S1320, in an alternative embodiment, the analysis device may receive, from a cloud management platform, the target SQL statement input or selected by a tenant at the cloud management platform. In another optional implementation manner, the analysis device may acquire the target SQL statement according to an internal preset method, for example, when a load abnormal time occurs in a database, the analysis device may perform log analysis on a target log of the database from the at least one log analysis dimension, and determine a target SQL statement to be optimized according to an analysis result of the target log in the at least one log analysis dimension. The embodiment of the application does not limit the acquisition mode of the target SQL statement.

For ease of understanding, the following description is provided in connection with a method flow diagram.

Referring to fig. 14, the cloud monitoring and analyzing method may include the steps of:

s1401 (optional step): the terminal equipment of the tenant can send configuration information to the monitoring device through the cloud management platform, and the configuration information can be used for configuring at least one monitoring index associated with the load abnormal event and an index threshold corresponding to the monitoring index. The monitoring indicator may for example comprise a system indicator and/or a traffic indicator, said system indicator at least one of the following: CPU load, memory load, disk read/write (IO) load, network packet loss rate or network delay of a computing device to which a service accessing the database belongs; the business index comprises at least one of the following: database total link number, database active connection number, table expansion rate, primary and backup synchronization rate, and slow SQL per minute. Details of the configuration may be referred to in the foregoing related description, and will not be described herein.

S1402 (optional step): the terminal device of the tenant may send configuration information to the log analysis processing cluster through the cloud management platform, which configuration information may be used, for example, to configure at least one log analysis dimension of the database. The at least one log analysis dimension may include, for example: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL time delay distribution dimension, or a query rate per second QPS dimension of SQL. Details of the configuration may be referred to in the foregoing related description, and will not be described herein.

S1403 (optional step): the terminal device of the tenant may send configuration information to the SQL analysis and test function component through the cloud management platform, which may be used, for example, to configure the number of SQLs to be analyzed and/or the load context parameters of the database. The SQL analysis and test function component may issue (or call injection) the load background parameter to the replication database node. Details of the configuration may be referred to in the foregoing related description, and will not be described herein. After the loading background parameter injection is completed, the repeated database node or the SQL analysis and test functional component can also feed back loading background parameter injection response information to the cloud management platform.

S1404: the monitoring device acquires index data of at least one monitoring index, and determines that the load abnormal event occurs in the database when the index data meets a preset analysis triggering condition (for example, the index data is greater than or equal to a corresponding index threshold).

In another embodiment, the step S1404 may be replaced with: the terminal equipment of the tenant sends indication information to the SQL analysis and test function component through the cloud management platform, wherein the indication information is used for indicating a target SQL statement to be optimized. That is, the on-line analysis process of the SQL may be triggered manually, and the triggering mode is not limited in the embodiment of the application.

S1405: the monitoring device indicates the range of log records to be analyzed and informs the log analysis processing cluster to perform log analysis.

S1406: the log analysis processing cluster can acquire a target log of the database, perform log analysis on the target log of the database from at least one log analysis dimension, and feed back a log analysis result to the monitoring device according to the analysis result of the target log in the at least one log analysis dimension.

S1407: the monitoring device informs the repeated database node to analyze the target SQL sentence.

S1408: and recovering the backup data packet of the database of the service system of the tenant by the repeated database node, and feeding back the SQL analysis result to the monitoring device after the repeated database node executes the target SQL statement.

S1409: the monitoring device generates an analysis report according to the information of the monitoring device, the information from the log analysis processing cluster, the SQL analysis and test functional component or the repeated database node.

S1410 (optional step): and the terminal equipment of the tenant requests to download the analysis report through the cloud management platform.

S1411 (optional step): the monitoring device feeds back an analysis report to the cloud management platform.

S1412 (optional step): the cloud management platform may output an analysis report.

For example, the output interface of the analysis report may present at least one report in a list as shown in fig. 15, and the inspection list of each analysis report may include an identification (e.g., serial number), type, inspection result, report creation time, report start-stop time, report progress, trigger mode, and related operations (e.g., viewing, mail, modification) associated with the analysis report. After the tenant selects a particular analysis report, the analysis report may be deleted by clicking on the "delete" button, or downloaded by clicking on the "export" button. After the tenant selects a specific analysis report, the tenant can perform corresponding viewing operation, newly-built mail operation or modification operation on the analysis report by clicking a button of "viewing", "mail" or "modification" associated with the analysis report. When the tenant views an analysis report, the analysis report may include, for example, an analysis result of at least one log analysis dimension of the database; index data of at least one monitoring index of the database; the target SQL statement; the execution plan of the target SQL statement; and rendering the map by the execution plan of the target SQL statement. The content of the analysis report may be presented in at least one of the following manners: progress bars, percentages, pie charts, lists, line charts, or dashboards. The relevant output interface may be described above in connection with fig. 8-10, and is not described here.

Therefore, by the cloud monitoring and analyzing method, after the relevant configuration information of the tenant is obtained, the cloud monitoring and analyzing system can automatically detect the load abnormal event of the database, automatically count and analyze the target log of the database according to the preconfigured information when the load abnormal event occurs in the database, so as to quickly identify the concurrent source or service scene, timely execute the coping strategy, maintain the stable operation of the service and reduce the service of the database.

The application also provides a computing device. As shown in fig. 16, the computing device 1600 includes: bus 1602, processor 1604, memory 1606, and communication interface 1608. The processor 1604, memory 1606, and communication interface 1608 communicate via a bus 1602. The computing device 1600 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 1600.

Bus 1602 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 16, but not only one bus or one type of bus. Bus 1604 may include a path for transferring information between various components of computing device 1600 (e.g., memory 1606, processor 1604, communication interface 1608).

The processor 1604 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signal processor, DSP).

The memory 1606 may include volatile memory (RAM), such as random access memory (random access memory). The processor 1604 may also include a non-volatile memory (ROM), such as read-only memory (ROM), flash memory, a mechanical hard disk (HDD), or a solid state disk (solid state drive, SSD).

The memory 1606 has stored therein executable program code that is executed by the processor 1604 to implement the functions of the devices included in the foregoing service system or the functions of the devices included in the foregoing cloud monitoring and analysis system, respectively, to implement the cloud monitoring and analysis method of the embodiments of the present application. That is, the memory 1606 has instructions stored thereon for performing the cloud monitoring and analysis method.

Communication interface 1608 enables communication between computing device 1600 and other devices or communication networks using transceiver modules such as, but not limited to, network interface cards, transceivers, and the like.

The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.

As shown in fig. 17, the cluster of computing devices includes at least one computing device 1600. The same instructions for performing the cloud monitoring and analysis methods may be stored in memory 1606 in one or more computing devices 1600 in the cluster of computing devices.

In some possible implementations, portions of the instructions for performing the cloud monitoring and analysis methods may also be stored in the memory 1606 of one or more computing devices 1600 in the cluster of computing devices, respectively. In other words, a combination of one or more computing devices 1600 may collectively execute instructions for performing the cloud monitoring and analysis methods.

It should be noted that, the memory 1606 in different computing devices 1600 in the computing device cluster may store different instructions for performing part of the functions of the cloud management platform or the analysis apparatus, respectively. That is, the instructions stored in the memory 1606 in the different computing devices 1600 may implement the functionality of one or more modules in the cloud management platform or analysis apparatus described previously.

In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 18 shows one possible implementation. As shown in fig. 18, two computing devices 1600A and 1600B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the analysis device are stored in memory 1606 in computing device 1600A. Meanwhile, an instruction to perform the function of the analysis means is stored in the memory 1606 in the computing device 1600B.

The manner of connection between clusters of computing devices shown in fig. 18 may be in view of the fact that the cloud monitoring and analysis method provided herein requires multiple computing devices, such as a large amount of stored data and analysis calculations, and thus in view of the functionality of the analysis apparatus being performed by computing device 1600A.

It should be appreciated that the functionality of computing device 1600A shown in fig. 18 may also be performed by multiple computing devices 1600. Likewise, the functionality of computing device 1600B may also be performed by multiple computing devices 1600.

It should be noted that the memory 1606 in different computing devices 1600 in the computing device cluster may store different instructions for performing part of the functions of the cloud monitoring and analysis system. That is, the instructions stored in the memory 1606 in the different computing devices 1600 may implement the functionality of one or more apparatuses in the cloud monitoring and analysis system.

Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform a cloud monitoring and analysis method.

Embodiments of the present application also provide a computer-readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform a cloud monitoring and analysis method or instruct a computing device to perform a cloud monitoring and analysis method.

Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to encompass such modifications and variations. In the various embodiments of the application, where no special description or logic conflict exists, the terms and/or descriptions between the various embodiments are consistent and may reference each other, and features of different embodiments may be combined to form new embodiments based on their inherent logic relationships.

Claims

1. A cloud monitoring and analysis method, the method comprising:

acquiring configuration information input or selected by a tenant in a cloud management platform, wherein the configuration information is used for representing monitoring and analysis requirements of the tenant on a database, and the database is used for storing service data of a service system of the tenant;

analyzing the target Structured Query Language (SQL) statement to be optimized according to the configuration information to obtain an analysis report, wherein the target SQL statement is a statement for operating the database historically.

2. The method according to claim 1, wherein the method further comprises:

and receiving the target SQL statement input or selected by the tenant in the cloud management platform.

3. The method of claim 1, wherein the analyzing the target structured query language SQL statement to be optimized according to the configuration information comprises:

and when the load abnormal event occurs in the database, analyzing the target SQL statement to be optimized according to the configuration information.

4. A method according to claim 3, wherein the configuration information is used to indicate at least one monitoring indicator associated with the load abnormality event and an indicator threshold corresponding to the monitoring indicator, the method further comprising:

acquiring index data of the at least one monitoring index;

and when the index data is greater than or equal to the corresponding index threshold value, determining that the load abnormal event occurs in the database.

5. The method according to claim 4, wherein the monitoring metrics comprise system metrics and/or traffic metrics, the system metrics being at least one of: CPU load, memory load, disk read/write (IO) load, network packet loss rate or network delay of a computing device to which a service accessing the database belongs; the business index comprises at least one of the following: database total link number, database active connection number, table expansion rate, primary and backup synchronization rate, and slow SQL per minute.

6. The method of any of claims 3-5, wherein the configuration information is used to indicate at least one log analysis dimension of the database, the method further comprising:

performing log analysis on a target log of the database from the at least one log analysis dimension;

and determining a target SQL statement to be optimized according to an analysis result of the target log in the at least one log analysis dimension.

7. The method of claim 6, wherein the at least one log analysis dimension comprises: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL time delay distribution dimension, or a query rate per second QPS dimension of SQL.

8. The method according to any one of claims 1-7, wherein analyzing the target structured query language SQL statement to be optimized according to the configuration information to obtain an analysis report comprises:

recovering backup data packets of the database at the multi-engraving database node;

and executing the target SQL statement at the rescheduling database node to obtain an analysis report.

9. The method of claim 8, wherein the configuration information is further used to indicate information describing a load scenario of the database, wherein the executing the target SQL statement at the multi-instance database node to obtain an analysis report comprises:

and constructing the load scene at the multi-engraving database node according to the configuration information, and executing the target SQL statement at the multi-engraving database node to obtain an analysis report.

10. The method of claim 9, wherein the information describing the load scenario of the database comprises at least one of: pressure schema name, service name, database name, concurrent number of SQL executions or SQL executed concurrently.

11. The method of any one of claims 1-10, wherein the analysis report includes at least one of:

an analysis result of at least one log analysis dimension of the database;

index data of at least one monitoring index of the database;

the target SQL statement;

the execution plan of the target SQL statement;

and rendering the map by the execution plan of the target SQL statement.

12. The method of claim 11, wherein the content of the analysis report is presented in at least one of the following manners: progress bars, percentages, pie charts, lists, line charts, or dashboards.

13. A cloud monitoring and analysis system, comprising:

the cloud management platform is used for receiving configuration information input or selected by a tenant, wherein the configuration information is used for representing monitoring and analysis requirements of the tenant on a database, and the database is used for storing service data of a service system of the tenant;

and the analysis device is used for analyzing the target Structured Query Language (SQL) statement to be optimized according to the configuration information to obtain an analysis report, wherein the target SQL statement is a statement for operating the database historically.

14. The system of claim 13, wherein the system further comprises a controller configured to control the controller,

the cloud management platform is used for receiving the target SQL sentences input or selected by the tenant in the cloud management platform.

15. The system of claim 13, wherein the analysis device is configured to:

16. The system of claim 15, wherein the configuration information is configured to indicate at least one monitored indicator associated with the load anomaly event and an indicator threshold value corresponding to the monitored indicator, and wherein the analyzing device is further configured to:

Acquiring index data of the at least one monitoring index;

17. The system of claim 16, wherein the monitoring metrics include system metrics and/or business metrics, the system metrics being at least one of: CPU load, memory load, disk read/write (IO) load, network packet loss rate or network delay of a computing device to which a service accessing the database belongs; the business index comprises at least one of the following: database total link number, database active connection number, table expansion rate, primary and backup synchronization rate, and slow SQL per minute.

18. The system according to any of claims 15-17, wherein the configuration information is used to indicate at least one log analysis dimension of the database, the analysis means being further for:

19. The system of claim 18, wherein the at least one log analysis dimension comprises: an execution time consuming dimension of a single SQL, a total execution time consuming dimension of a single category SQL, an SQL execution service duty cycle dimension, an SQL connection host duty cycle dimension, an SQL time delay distribution dimension, or a query rate per second QPS dimension of SQL.

20. The system according to any one of claims 13-19, wherein the analysis device is configured to:

21. The system of claim 20, wherein the configuration information is further used to indicate information describing a load scenario of the database, and wherein the analyzing means is configured to:

22. The system of claim 21, wherein the information describing the load scenario of the database comprises at least one of: pressure schema name, service name, database name, concurrent number of SQL executions or SQL executed concurrently.

23. The system of any one of claims 13-22, wherein the analysis report includes at least one of:

an analysis result of at least one log analysis dimension of the database;

index data of at least one monitoring index of the database;

The target SQL statement;

the execution plan of the target SQL statement;

and rendering the map by the execution plan of the target SQL statement.

24. The system of claim 23, wherein the content of the analysis report is presented in at least one of the following manners: progress bars, percentages, pie charts, lists, line charts, or dashboards.

25. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory;

the processor of the at least one computing device is configured to execute instructions stored in a memory of the at least one computing device to cause the cluster of computing devices to perform the method of any of claims 1-12.

26. A computer program product containing instructions that, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1-12.

27. A computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform the method of any of claims 1-12.