CN111597099B - Non-invasive simulation method for monitoring running quality of application deployed on cloud platform - Google Patents

Non-invasive simulation method for monitoring running quality of application deployed on cloud platform Download PDF

Info

Publication number
CN111597099B
CN111597099B CN202010424175.0A CN202010424175A CN111597099B CN 111597099 B CN111597099 B CN 111597099B CN 202010424175 A CN202010424175 A CN 202010424175A CN 111597099 B CN111597099 B CN 111597099B
Authority
CN
China
Prior art keywords
monitoring
quality
application
cloud platform
cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010424175.0A
Other languages
Chinese (zh)
Other versions
CN111597099A (en
Inventor
祝乃国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Electronic Port Co ltd
Original Assignee
Shandong Electronic Port Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Electronic Port Co ltd filed Critical Shandong Electronic Port Co ltd
Priority to CN202010424175.0A priority Critical patent/CN111597099B/en
Publication of CN111597099A publication Critical patent/CN111597099A/en
Application granted granted Critical
Publication of CN111597099B publication Critical patent/CN111597099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3457Performance evaluation by simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform, which belongs to the technical field of cloud computing. The essence of the invention is to monitor the personalized quality of the cloud platform, take tenant application as a verification sample, and solve the problem that cloud service providers can only provide the whole SLA of the cloud platform and cannot provide service quality experience for specific tenants. The personalized use experience monitoring scheme is provided on the premise that probes cannot be deployed in the tenant application use service instance without authorization.

Description

Non-invasive simulation method for monitoring running quality of application deployed on cloud platform
Technical Field
The invention relates to the technical field of cloud computing, in particular to a non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform.
Background
Cloud service products based on cloud computing have become a mainstream resource supporting scheme of IT, cloud loading and cloud utilization replace the traditional use mode and operation and maintenance mode, service providers dominate cloud resource providing and service, and tenants use cloud resources. With the expansion of cloud services, the current major contradiction has been to shift from "service usage" to "service experience". The client has higher and higher quality requirements on the application running on the cloud, and the cloud service provider is frequently complained due to problems of opening delay, performance reduction and the like.
Disclosure of Invention
The technical task of the invention is to provide a non-invasive simulation method for monitoring the running quality of the application deployed on the cloud platform aiming at the defects, so that better cloud resource service can be provided for the tenant without invading resources used by the tenant on the premise of no authorization, and personalized use experience guarantee is realized.
The technical scheme adopted for solving the technical problems is as follows:
a non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform comprises the steps of establishing an external virtual machine on the same platform, virtualizing homogeneous external resources, simulating the running environment of a user application, collecting running quality indexes of the application availability, the cloud platform, computing, storage and a network, and realizing the running quality monitoring based on tenant specific applications.
Because the cloud platform service provider does not have the operation authority of the client resource, the use experience of the tenant is simulated by establishing an external simulation environment.
The non-invasive tenant application running quality simulation detection running on the cloud platform is constructed from the aspects of application availability, platform, calculation, storage and network 5, and takes the application as an object, so that the running quality of the whole cloud platform is substantially embodied, the operation and maintenance mode is ensured from the whole rough mode to the individuation of the application, the problem of application availability is solved, and the possible cause when the application is not available is also solved.
Preferably, the collection of the running quality index of the application availability comprises the acquisition and acquisition of the access quality index outside the application cloud and the acquisition and acquisition of the access quality index inside the application cloud,
the application cloud access is performed, the operation that a user opens the application access through a browser is simulated, the running quality of the application is simulated and tested by using a command, the simulation command supports the commonly used IE (v 8 and above versions), the browser such as Google, firefox and the like, and the execution result of the simulation command is the direct embodiment of the running quality of the application; the indexes comprise NDS analysis time, TCP protocol establishment time, system white screen time, home page display time and downloading speed;
the acquisition mode is based on the HTTP protocol, combined with return code confirmation, the acquisition command is automatically adapted to the conditions of redirection and the like of 3XX, 4XX and 5XX are judged to be abnormal return values, and when abnormality occurs, other link results are used for assisting in judging fault reasons, such as quality of in-cloud access, calculation, storage, network and the like;
in order to reduce abnormal alarms monitored by an application, particularly access delay alarms caused by network quality, an alarm mechanism is optimized, the method uses the monitored application as a distinguishing object, carries out algorithm processing on the current network quality Time delay and the normal access Time length of the application to obtain a threshold value APP-Time, and sends out the abnormal alarms when the monitored access Time length exceeds the APP-Time; applying the home page loading Time length, setting a dynamic threshold value APP-Time, and calculating through a function F (applying normal access Time length, current network delay Time length);
by applying intra-cloud access, deploying a simulation detection program to a network node, and performing simulation test in a tenant VLAN gateway, a relatively real monitoring result can be obtained;
according to the structure condition, the cloud application simulation dial testing can skip a DNS link, and directly use an intra-cloud address (at least an accessible streaming IP address) to detect an HTTP protocol, wherein indexes comprise connection establishment time, system white screen time, home page display time and downloading speed;
the abnormal application dial-up test return code or the overtime access can be notified in an alarm mode, and the algorithm and the implementation refer to the external cloud part.
Preferably, the quality index of the cloud platform is acquired, the network card metadata interface and the port metadata interface are periodically called for calculation through an acquisition process in the gas, and the integrity and the time delay of information acquisition are evaluated to evaluate the quality of the supporting capability of the cloud platform. The index is realized in a judging strategy of a mail (client), and if abnormal, the index is uploaded to a wall (server) for unified management and alarming.
Furthermore, an atomic carrier gas is established by taking a host machine of the cloud platform as a unit to bear simulation monitoring functions of calculation, storage and network quality, a two-stage system is adopted for function implementation in a monitoring scheme, an execution unit is gas, and a management unit is wall; the tail is unified and critical and managed by the virtualization platform so as to more intuitively embody the running conditions of the virtualization platform, such as running fluctuation, delay and the like of the cloud platform; after the host machine where the nail is located fails, the nail does not need to be evacuated to other host machines;
the method comprises the steps that a program for monitoring is installed in the gas, a monitoring strategy is uniformly managed by a centralized management system wall, a channel is established, and related information is interacted;
periodically reporting the survival state of the user to the wall according to a heartbeat mechanism, wherein the wall cannot receive the report of the soil in a specified period, and judging whether the user is a host machine problem or a cloud platform swarm problem according to the number condition;
the method comprises the steps that a connectivity monitoring object and a monitoring strategy initiated by the tail are judged and issued by the wall;
the storage IO monitoring mode initiated by the tail, such as sequential reading/writing, random reading/writing and the size of a data block, can be customized, adjusted and issued according to the monitoring requirement;
and the network initiated point-to-point data packet transmission monitoring strategy is configured and issued by the wall.
Preferably, for acquisition and acquisition of calculation quality indexes, judging the supporting capacity of the virtual machine on the host from the perspective of the host, wherein the indexes comprise the resource supporting capacity of a CPU and a memory, the interrupt waiting times caused by untimely IO of the CPU and the queuing length of the CPU processing task;
executing a data interface of a cloud platform operation control node through the tail, monitoring the access reachability of the platform initiated from the point, and performing simulation verification on other access reachability;
and collecting and feeding back the utilization rate of the gas in the CPU, the memory and the local hard disk of the host machine, the allocation rate of the vCPU and the memory, and informing the abnormal information in an alarm mode.
Preferably, for acquisition and acquisition of storage quality indexes, detecting programs of two modes, namely a file mode and a database mode, are deployed in the nail to monitor the running quality of the bottom storage, and simulation access is carried out from deployed applications to judge whether IO influences normal running;
the file performance index is obtained by repeatedly operating a file body according to a monitoring strategy through a program deployed in the soil;
the database performance index is obtained by query and obtaining the data of the appointed database table by SQL sentences besides the QPS and TPS of the database, and the time consumption for querying the fixed data is recorded and used for verifying the availability of the current database and whether the speed for querying the database is normal or not.
IO quality monitoring between the file and the cloud platform storage resource pool comprises performance indexes such as sequential reading and writing, random reading and writing time delay, jitter, IOPS and the like;
IO quality monitoring between the database and the cloud platform storage resource pool comprises performance indexes such as QPS, TPS and the like.
A threshold value can be set for the monitored performance index, and abnormal data is notified to the centralized management wall in an alarm mode;
in addition to the performance metrics, error logs generated during file and database interactions with the storage are also sent to the centralized management system wall in an alert manner.
Preferably, for the acquisition of network quality indexes, the mutual access quality of different virtual machines in the same application is tested through the simulation test of the existing physical same-route facts.
Network quality detection of protocols such as TCP, HTTP, ICMP can be realized through agents deployed in the soil, and routes and quality conditions passing through can be tracked by using traceroute and the like. This operation may identify response delays within the same application that are affected by network quality. The problems of packet loss, error, time delay, unreachable target and the like among different virtual machines in the same application can be realized through simulation monitoring.
And configuring the nail on the same host as a transparent mode, leading the data flow needing to analyze the virtual machine to a designated PORT of the nail, acquiring a corresponding data packet through the designated source IP, the PORT or the target IP, the PORT, combining with auxiliary conditions such as a transmission protocol and the like, and analyzing and finding out the specific reason of unsuccessful network connection or program statement content executed for a long time.
Further, new monitoring objects and policies are automatically added according to the CMDB,
after the cloud service provider opens the resources, automatically forming an application internal topological relation according to the set application information; automatically forming monitoring points according to topological relation, such as network delay, jitter and other qualities between web and app and between app and db, and setting a threshold value in a personalized way according to application characteristics, and sending abnormal data in an alarm mode; simulating a protocol used by an application to find transmission quality conditions such as packet loss, jitter, delay and the like on an end-to-end route;
grabbing a data packet from a source to a target according to the needs of a monitoring point, judging contents and a return value, assisting in positioning possible problems,
determining quality problems and possible reasons of connection between the application server and the database server through data packet content analysis;
the quality problem and possible cause of the connection of the application server to the web server are confirmed by the analysis of the content of the data packets.
The essence of the method is to monitor the personalized quality of the cloud platform, take tenant application as a verification sample, and solve the problem that cloud service providers can only provide the whole SLA of the cloud platform and cannot provide service quality experience for specific tenants. The personalized use experience monitoring scheme is provided on the premise that probes cannot be deployed in the tenant application use service instance without authorization.
The invention also claims a non-invasive simulation device for monitoring the running quality of the application deployed on the cloud platform, comprising: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to perform the method described above.
The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.
Compared with the prior art, the non-invasive simulation method for monitoring the running quality of the application deployed on the cloud platform has the following beneficial effects:
the method constructs the non-invasive tenant application running quality simulation monitoring method running on the cloud platform from five aspects of application availability, platform, calculation, storage and network. The application is taken as an object, so that the running quality of the whole cloud platform is substantially embodied, and the operation and maintenance mode is ensured from the whole rough mode to the individuation of the application. The method solves the problem of application availability and the possible cause when the application is not available.
The problem of information asymmetry between the cloud service provider and the tenant due to the problem of monitoring authority is solved, namely the tenant does not authorize the service provider, but the service provider is required to give out specific reasons and solutions when the application is not available;
the service provider provides a key guarantee and personalized supporting means for tenants, and provides data basis and foundation for different SLAs for the service provider;
the problem that tenant application use experiences are 'averaged' by platform statistics is solved;
the problem of determining the influence depth of certain platform faults on tenant applications is solved, and a service provider can clearly determine whether the faults are eradicated or not when the faults are processed, and the influence on tenant applications is eliminated.
Drawings
FIG. 1 is a functional architecture diagram of the method provided by one embodiment of the present invention;
FIG. 2 is a schematic illustration of an end-to-end application provided in accordance with one embodiment of the present invention;
FIG. 3 is a schematic diagram of virtual machine network connection according to an embodiment of the present invention;
fig. 4 is a deployment architecture diagram of the method provided by one embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and the specific examples.
Because the current cloud service provider gives the client the right to use when delivering and storing the service product (IaaS), the client has full authority on the resources during the lease period, and the service provider can only support the normal operation of the resources through the operation and maintenance, but cannot enter the application monitoring operation quality of the client. Therefore, a contradictory experience is often experienced between the server-perceived quality of operation and the tenant-perceived quality of operation of the application. For data security and privacy, the application of the tenant cannot authorize the service provider to monitor the running quality; on the premise that the service provider cannot conduct targeted monitoring, the service provider can only deal with the operation quality of the platform. For the tenant, the developer is basically not responsible as long as the application can run. According to the practical situation, the running application is not good in user experience, and is mainly influenced by environmental factors such as network, calculation and storage, but the problems of poor program quality and the like are unavoidable, so that the problem of how to locate the service provider or how to self-prove the service provider is directly contradicted by the fact that the tenant does not give the service provider related authority.
The cloud service provider has corresponding program monitoring on computation, storage and network and also has corresponding service quality requirements, such as availability is marked above 99.95%, but most of the availability is statistical data of platformness, the availability cannot be reflected from the application of a specific tenant, the using experience of the actual tenant is 'averaged', namely the whole platform quality may be 99.95%, but the using experience is not 99.95% for the tenant, but may be just one tenant in 0.05%, the cloud service provider signs an availability protocol with the tenant when providing service, and the statistical availability of the using platform corresponds to unequal application availability of the tenant in dimension in the protocol from the perspective of the service provider.
For both the service provider and the tenant, a non-invasive simulation monitoring is needed to realize statistics on the availability of the tenant application. The service provider solves the application availability monitoring in the signing agreement; the tenant solves the problem of enabling the service provider to provide better cloud resource service when no authorization exists.
How to provide better service use experience for tenants, provide better service capability for tenants, and how to realize personalized use experience guarantee without invading resources used by the tenants on the premise of no authorization, and provide personalized service, thus being a work task which needs to be solved by cloud service providers at present.
The embodiment of the invention provides a non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform, which can be realized by the following steps:
a non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform comprises the steps of establishing an external virtual machine on the same platform, virtualizing homogeneous external resources, simulating the running environment of a user application, collecting running quality indexes of the application availability, the cloud platform, computing, storage and a network, and realizing the running quality monitoring based on tenant specific applications.
Because the cloud platform service provider does not have the operation authority of the client resource, the use experience of the tenant is simulated by establishing an external simulation environment. According to the method, the non-invasive tenant application running quality simulation detection running on the cloud platform is constructed from the aspects of application availability, platform, calculation, storage and network 5, and the running quality of the whole cloud platform is substantially embodied by taking the application as an object, so that the running mode is ensured from the whole rough mode to the individuation of the application, the problem of application availability is solved, and the possible cause when the application is not available is also solved.
The method is realized by solving the non-invasive application running quality monitoring function from the external simulation point of view. As shown in the functional architecture of fig. 1: the main summary of this embodiment focuses on how to collect the data acquisition layer of the operation quality index, and the functions of the presentation layer, the policy and the scheduling layer adopt conventional methods, which are not described.
1. Application cloud external access quality index design and acquisition
Application cloud access refers to an application with an external address (such as an internet address) or a domain name to simulate the operation of opening application access by a user through a browser, and commands are used to simulate the running quality of a test application. Different developers have different requirements on the kernel because of different using technologies, and the simulation command supports commonly used browsers such as IE (v 8 and above versions), google, firefox and the like. The result of the execution of the simulation command is the direct representation of the running quality of the application, and mainly comprises the following indexes: DNS resolution time, TCP protocol setup time, system white screen time, home page display time, download speed, etc. And drawing the end-to-end running quality of application access based on the performance index obtained during normal access, wherein the end-to-end running quality of the application cloud external access quality is shown in figure 2.
The acquisition mode is based on the HTTP protocol and is combined with the return code confirmation. The acquisition command is automatically adapted to the conditions of 3XX redirection and the like, and 4XX and 5XX are judged to be abnormal return values. When an abnormality occurs, other link results are used for assisting in judging the fault reasons, such as the quality of access, calculation, storage, network and the like in the cloud.
In order to reduce abnormal alarms of application monitoring, particularly access delay alarms caused by network quality, an alarm mechanism is optimized. The method takes monitoring application as a distinguishing object, carries out algorithm processing on the current network quality Time delay and the normal access Time length of the application to obtain a threshold value APP-Time, and sends out an abnormal alarm when the monitored access Time length exceeds the APP-Time. And (4) applying the home page loading Time, setting a dynamic threshold value APP-Time, and calculating through a function F (application normal access Time, current network delay Time).
2. Application cloud access quality index design and acquisition
The cloud platform supports a multi-tenant scenario, in order to realize resource isolation among tenants, the network makes a lot of access policy restrictions, and the use of overlay networks in the SDN may further lead to complexity of the cloud network. The current situation is that the network quality inside the tenant (VLAN) is relatively good, the service network interworking quality between the management network and the tenant is poor, and false alarms are caused by the network quality (between the management and the tenant) in many cases. Currently, the OpenStack virtualization platform generally adopts a dedicated network node, and vlan gateways of a plurality of tenants are created on the network node through a nettype of a naspace. In the method, a simulation monitoring program is deployed on a network node, and a simulation test is performed in a VLAN of a tenant, so that a relatively real monitoring result can be obtained.
According to the structure condition, the cloud application simulation dial test can skip the DNS link, and the cloud address (at least the accessible streaming IP address) is directly used for detecting the HTTP protocol. The core indexes mainly comprise connection establishment time, system white screen time, home page display time, downloading speed and the like, and the display result is end-to-end of the access quality in the application cloud shown in the figure 2.
The abnormal application dial-up test return code or the overtime access can be notified in an alarm mode, and the algorithm and the implementation refer to the external cloud part.
3. Cloud platform quality index design and acquisition
The general virtualized cloud platform is divided into two types of functions of a control plane and a service plane. The control plane manages metadata and management of business scenes such as mutual access strategies, dynamic migration and the like among cloud resources; and the service plane supports the operation of cloud service products according to the set parameters. The operation performance of the service plane realizes content and mode through three elements of calculation, storage and network in the subsequent analysis; the section mainly analyzes the supporting quality condition of the cloud platform.
In the OpenStack virtualization platform, the control plane stores metadata and performs benchmarking consistency with the local configuration. Processes running on each node can periodically acquire comparison information in synchronization with the metadata of the control plane, or periodically insert information into the metadata table. The reachability and timeliness of information between nodes and metadata tables affects the operation of important scenarios such as network connections, access policies, mounted hard disks, and the like. The information acquisition is realized by transferring the called interface through the message queue and asynchronously executing. The response time of the message queue and whether the message queue is congested are important indexes for determining the supporting capacity of the platform to the cloud service.
According to the embodiment, through the acquisition process in the soil, the network card metadata interface and the port metadata interface are periodically called to calculate and evaluate the integrity and time delay of information acquisition, and the quality of the supporting capability of the cloud platform is evaluated. The index is realized in a judging strategy of a mail (client), and if abnormal, the index is uploaded to a wall (server) for unified management and alarming.
4. Design and acquisition of calculated quality index
The operation basis of the computing virtualization is a time-sharing mechanism of an operating system, namely, CPU resources are provided for different processes to use in a time slice mode. The virtual machine serving as the basic resource of the cloud platform is used for simulating various peripheral equipment and components to support the dispatching operation of the virtual machine operating system through software. The method is designed and calculated, and the virtual running quality judges the supporting capacity of the virtual machine on the virtual machine from the perspective of the host machine, and mainly relates to the following indexes: the resource supporting capability of the CPU and the memory, the interrupt waiting times of the CPU caused by untimely IO, the queuing length of the processing task of the CPU and the like.
The specific calculation is as follows:
CPU, resource support capability of memory. The method comprises the steps of collecting respective allocation amounts total_num (corresponding resource summation allocated to virtual machines) of a CPU and a memory, calculating current usage percentage (average usage percentage of the virtual machines) of the CPU and the memory, obtaining the occupied condition of host resources by calculating total_num, and prompting the risk of insufficient supporting capacity if an alarm value is exceeded.
The waiting times of interruption of the CPU caused by IO interaction congestion can be acquired, so that the influence of the CPU work on the IO interaction of external data can be demonstrated, and virtual machines on the CPU work can be influenced.
The queuing length of the CPU processing task measures the working state of the CPU for a long time, generally uses the CPU load, and pays less attention to the busyness (queuing length) of the CPU. If no task is queued, the CPU load is healthy and busy no matter the size, and the system operation is not affected; if the queuing length is high, the CPU is busy no matter the load of the CPU is high or low.
The collection program of the partial index is deployed at a management end (wall) to face all hosts, and provides targeted analysis opinion according to the hosts where the application using virtual machines are located.
5. Storage quality index design and acquisition
In a cloud computing environment, storage is generally mounted on a virtual machine in the form of an external resource pool for use. The storage platform has own monitoring and can provide higher SLAs. If partial IO congestion and delay occur, the partial IO congestion and delay can be ignored from the storage platform, but the partial IO congestion and delay can have poor use experience for tenant application. In addition, because part of applications are sensitive to delay and interruption of IO, if the storage layer has a short problem, the upper layer applications can not be communicated (the general processing mode is that the applications or databases need to be restarted). By disposing detection programs for files and databases in the nail, the running quality of the bottom storage is monitored, simulation access is performed from the deployed application, and whether IO (input/output) influences normal running is judged.
Aiming at two types of application common data operation modes of files and databases, performance indexes are respectively designed as follows:
file performance index. Repeatedly operating a file body by a program deployed in the nail according to a monitoring strategy (block size: 4K/1M, mode: random/sequence), and obtaining related performance indexes: the IOPS of the read/write, the queuing queue length of the IO and the duration of the IO response. The indexes can be obtained by using operating system commands such as fio\iostat and the like or self-developed test programs.
Database performance index. Except for acquiring QPS and TPS of a database; the data of the appointed database table is also queried and obtained by SQL sentences, the time consumption for querying the fixed data is recorded and is used for verifying that the current database is available and whether the query speed of the data is normal or not. These acquisition procedures are all implemented by self-grinding and deployed into the nail. Databases are determined by the type of application use, typically based on open sources, such as mysql, pg, etc.
6. Network quality index design and acquisition
To ensure inter-tenant resource isolation, network policy implementation is typically used. The network quality index mainly considers the operation quality condition of the internal connection of the application, as shown in fig. 3, 3 virtual machines of application A are positioned on host1, host2 and host3, and the access relation among host1-vm-a, host2-vm-a and host3-vm-a has the same physical routing path as host 1-tail, host 2-tail and host 3-tail. Because the network interface is realized on the host1, whether the network interface is a mail or a vm-a interface, the network interface is connected to the physical network card of the host1 through ovs (virtual switch), interaction with the outside is realized (the quality of a plurality of virtual machines on the same host is not influenced by the outside, so that the quality can be ensured and the monitoring range is not ensured).
Based on the above analysis of co-routing, vm and nail distributed over host1/host2/host3 have similar network operational quality. The method can simulate and test the mutual access quality of different virtual machines in the same application through the existing physical same-route facts.
Network quality detection of protocols such as TCP, HTTP, ICMP can be realized through agents deployed in the soil, and routes and quality conditions passing through can be tracked by using traceroute and the like. This operation may identify response delays within the same application that are affected by network quality. The problems of packet loss, error, time delay, unreachable target and the like among different virtual machines in the same application can be realized through simulation monitoring.
In actual operation, more than 60% of the problems affecting the operation of the application come from network reasons. Besides the obvious problems of unreachable targets, serious packet loss and the like, there are deeper reasons of unsuccessful connection establishment, mismatch of MTU correspondence and the like. To obtain these reasons, even the degradative program problem in interactive content, the package needs to be opened to view the content. And configuring the nail on the same host as a transparent mode, leading the data flow needing to analyze the virtual machine to a designated PORT of the nail, acquiring a corresponding data packet through the designated source IP, the PORT or the target IP, the PORT, combining with auxiliary conditions such as a transmission protocol and the like, and analyzing and finding out the specific reason of unsuccessful network connection or program statement content executed for a long time. Because the operation needs to occupy a large memory space and computing resource expenditure, the operation is not started under normal conditions, and when the cause analysis is needed for some problematic applications, the analysis operation is carried out by issuing a configuration monitoring strategy.
Structure and deployment specification
Because the method relates to data acquisition and management of multiple aspects of application, platform, calculation, storage, network and the like, the combing deployment content and structure are shown in figure 4.
The acquisition method used in the method partially uses the monitoring command of the operating system, and is partially modified and self-developed, and the application point of the right protection is mainly non-invasive simulation monitoring, and does not relate to specific commands and methods, so that specific program implementation is not described.
The method is mainly faced with an OpenStack virtualization platform and can also be used as an operation quality monitoring solution in the OpenStack virtualization field.
The embodiment of the invention also provides a non-invasive simulation method for monitoring the running quality of the application deployed on the cloud platform, which can be realized by the following steps:
a non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform comprises the steps of establishing an external virtual machine on the same platform, virtualizing homogeneous external resources, simulating the running environment of a user application, collecting running quality indexes of the application availability, the cloud platform, computing, storage and a network, and realizing the running quality monitoring based on tenant specific applications.
Because the cloud platform service provider does not have the operation authority of the client resource, the use experience of the tenant is simulated by establishing an external simulation environment. And creating a simulated atomic carrier-nail (nail) by taking a host machine managed by the cloud platform as a unit, and carrying out simulation detection functions of calculation, storage and network quality. The function implementation in the monitoring scheme adopts a two-stage system: the execution unit is a nail (tail) and the management unit is a wall (wall). The method provides an operation quality monitoring result based on tenant specific application, and is not a quality monitoring result for cloud resource platform statistics.
Virtual machines are created on a host machine managed by the cloud platform and used as monitored atomization function carriers, and the virtual machines are named nails (nails).
Nails (nails) are unified and critical and managed by the virtualization platform so as to more intuitively embody the running condition of the virtualization platform. Such as fluctuation, delay and the like of the operation of the cloud platform, nails can be felt;
after the host machine where the nails (nails) are located fails, the nails (nails) do not need to be evacuated to other host machines;
nails (nails) internally install a program for monitoring, a monitoring strategy is uniformly managed by a centralized management system (wall), a channel is established, and related information is interacted, wherein the monitoring strategy comprises the following steps:
1) Periodically reporting the survival state of the host to the wall according to a heartbeat mechanism, wherein the wall cannot receive the report of the nals in a specified period, and judging whether the host is a host problem or a cloud platform swarm problem according to the number condition;
2) The connectivity monitoring object and the monitoring strategy initiated by the gas are judged and issued by the wall;
3) The storage IO monitoring mode initiated by the tail, such as sequential reading/writing, random reading/writing and the size of a data block, can be customized, adjusted and issued according to the monitoring requirement, so that the simulation requirement can be well met;
4) And the network initiated point-to-point data packet transmission monitoring strategy is configured and issued by the wall.
The non-invasive simulation realizes the monitoring of network quality of the tenant and the application virtual machine:
1) Automatically adding new monitoring objects and policies according to the CMDB, comprising:
after the cloud service provider opens the resources, automatically forming an application internal topological relation according to the set information such as the use and the like;
according to the topological relation, automatically forming monitoring points, such as network delay, jitter and other qualities between web and app and db, and setting a threshold value in a personalized way according to application characteristics, and sending abnormal data in an alarm mode;
the network quality monitoring mode comprises a common protocol such as TCP, HTTP, ICMP, and the like, and the protocol used by the simulation application can more accurately discover the transmission quality condition such as packet loss, jitter, delay and the like on the end-to-end route.
2) And grabbing a data packet from a source to a target according to the need of a monitoring point, judging the content and the return value, and assisting in positioning possible problems. The method is characterized in that:
and determining the quality problem and possible reasons of the connection between the application server and the database server through the content analysis of the data packets.
The quality problem and possible cause of the connection of the application server to the web server are confirmed by the analysis of the content of the data packets.
The monitoring of the access storage operation quality of the tenant virtual machine is realized through non-invasive simulation:
IO quality monitoring between the file and the cloud platform storage resource pool comprises the following steps: performance indexes such as sequential reading and writing, random reading and writing time delay, jitter, IOPS and the like.
IO quality monitoring between a database and a cloud platform storage resource pool comprises the following steps: QPS, TPS, etc.
A threshold value may be set for the monitored performance index, and abnormal data is notified to a centralized management (wall) in an alarm manner.
In addition to the performance metrics, error logs generated during file and database interactions with the storage will also be sent in an alert fashion to the centralized management (wall).
The monitoring of the operation stability and the quality of the tenant virtual machine is realized by non-invasive simulation:
and executing a data interface of the cloud platform operation control node through the tail, monitoring the access reachability of the platform initiated from the point, and performing simulation verification on other access reachability.
And collecting and feeding back the utilization rate of the CPU, the memory and the local hard disk of the host machine where the nail is located, and the allocation rate of the vCPU and the memory, and informing the abnormal information in an alarm mode.
The method provides network quality monitoring of the same route, storage performance monitoring of the same connection and calculation operation stability monitoring of the same cluster by establishing the mode of the external virtual machine on the same platform, and guarantees three core capacities of network, calculation and storage required by virtual machine operation. And the directional packet grabbing tool is utilized to realize the monitoring of the connection capability between the appointed virtual machines, such as the delay monitoring between the background service virtual machine and the database virtual machine which are commonly used in the application.
According to the method, on the premise that the tenant does not need to authorize the application, the actual experience of the application operation is obtained through approximate simulation, the use experience of the tenant is restored to the greatest extent, and a data base is provided for personalized SLA guarantee. The requirements of cloud service providers for responding to the personalized operation quality guarantee of tenants are met with lower cost and operation influence. By finding problems in advance or in time, targeted optimization and treatment are carried out before complaints of tenants, so that the operation quality is improved, and the satisfaction is improved.
The embodiment of the invention also provides a non-invasive simulation device for monitoring the running quality of the application deployed on the cloud platform, which comprises: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
the at least one processor is configured to invoke the machine-readable program to execute the non-intrusive simulation method for monitoring the running quality of an application deployed on a cloud platform according to any of the embodiments of the present invention.
The embodiment of the invention also provides a computer readable medium, wherein the computer readable medium is stored with computer instructions, and when the computer instructions are executed by a processor, the processor is caused to execute the non-invasive simulation method for monitoring the running quality of the application deployed on the cloud platform in any embodiment of the invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.
In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.
Examples of the storage medium for providing the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communication network.
Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.
Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.
While the invention has been illustrated and described in detail in the drawings and in the preferred embodiments, the invention is not limited to the disclosed embodiments, and it will be appreciated by those skilled in the art that the code audits of the various embodiments described above may be combined to produce further embodiments of the invention, which are also within the scope of the invention.

Claims (9)

1. The non-invasive simulation method for monitoring the running quality of the application deployed on the cloud platform is characterized in that an external virtual machine is built on the same platform, homogeneous external resources are virtualized, the running environment of the user application is simulated, and running quality indexes of the application availability, the cloud platform, calculation, storage and a network are collected, so that the running quality monitoring based on the tenant specific application is realized;
an atomic carrier gas is established by taking a host machine of a cloud platform as a unit, a program for monitoring is installed in the gas, a monitoring strategy is uniformly managed by a centralized management system wall, a channel is established, and related information is interacted;
periodically reporting the survival state of the user to the wall according to a heartbeat mechanism, wherein the wall cannot receive the report of the soil in a specified period, and judging whether the user is a host machine problem or a cloud platform swarm problem according to the number condition;
the method comprises the steps that a connectivity monitoring object and a monitoring strategy initiated by the tail are judged and issued by the wall;
the storage IO monitoring mode initiated by the tail and the size of the data block can be customized, adjusted and issued according to the monitoring requirement;
and the network initiated point-to-point data packet transmission monitoring strategy is configured and issued by the wall.
2. The non-intrusive simulation method of monitoring the operational quality of an application deployed on a cloud platform as defined in claim 1, wherein the operational quality indicator collection of application availability comprises an acquisition of an application external cloud access quality indicator and an acquisition of an application internal cloud access quality indicator,
the application cloud access is performed, the operation of opening the application access by a user through a browser is simulated, the running quality of the application is simulated and tested by using a command, and the execution result of the simulation command is the direct embodiment of the running quality of the application; the indexes comprise NDS analysis time, TCP protocol establishment time, system white screen time, home page display time and downloading speed;
applying intra-cloud access, deploying a simulation detection program to a network node, and performing simulation test in a tenant VLAN gateway; the indexes comprise connection establishment time, system white screen time, home page display time and downloading speed.
3. The non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform according to claim 1, wherein the acquisition of the quality index of the cloud platform is performed by periodically calling the network card metadata interface and the port metadata interface for calculation through the acquisition process in the nail, and evaluating the integrity and the time delay of the information acquisition to evaluate the quality of the supporting capability of the cloud platform.
4. A non-invasive simulation method for monitoring running quality of an application deployed on a cloud platform according to claim 1 or 3, wherein, for acquisition and acquisition of a calculation quality index, a data interface of a cloud platform operation control node is executed through a nail, platform access reachability from the point is monitored, and simulation verification is performed for other access reachability;
and collecting and feeding back the utilization rate of the gas in the CPU, the memory and the local hard disk of the host machine, the allocation rate of the vCPU and the memory, and informing the abnormal information in an alarm mode.
5. The non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform according to claim 1 or 3, wherein the acquisition of storage quality indexes is performed by deploying detection programs for two modes of files and databases in a nail, monitoring the running quality of bottom storage, and performing simulation access from the deployed application to judge whether IO affects normal running;
IO quality monitoring between the file and the cloud platform storage resource pool comprises sequential reading and writing, random reading and writing time delay, jitter and IOPS performance indexes;
IO quality monitoring between the database and the cloud platform storage resource pool comprises QPS and TPS performance indexes.
6. The non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform according to claim 1, wherein the acquisition of network quality indexes is performed by simulating and testing the mutual access quality of different virtual machines in the same application through existing physical same-route facts.
7. The non-invasive simulation method for monitoring the running quality of an application deployed on a cloud platform according to claim 6, wherein automatically adding new monitoring objects and policies according to the CMDB includes automatically forming an application internal topology according to the set usage information, automatically forming monitoring points according to the topology, and simulating a protocol used by the application to find the transmission quality on an end-to-end route;
and grabbing a data packet from a source to a target according to the need of a monitoring point, judging the content and the return value, and assisting in positioning possible problems.
8. A non-intrusive simulation device for monitoring the quality of operation of an application deployed on a cloud platform, comprising: at least one memory and at least one processor;
the at least one memory for storing a machine readable program;
said at least one processor for invoking said machine readable program to perform the method of any of claims 1 to 7.
9. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 7.
CN202010424175.0A 2020-05-19 2020-05-19 Non-invasive simulation method for monitoring running quality of application deployed on cloud platform Active CN111597099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010424175.0A CN111597099B (en) 2020-05-19 2020-05-19 Non-invasive simulation method for monitoring running quality of application deployed on cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010424175.0A CN111597099B (en) 2020-05-19 2020-05-19 Non-invasive simulation method for monitoring running quality of application deployed on cloud platform

Publications (2)

Publication Number Publication Date
CN111597099A CN111597099A (en) 2020-08-28
CN111597099B true CN111597099B (en) 2023-07-04

Family

ID=72187416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010424175.0A Active CN111597099B (en) 2020-05-19 2020-05-19 Non-invasive simulation method for monitoring running quality of application deployed on cloud platform

Country Status (1)

Country Link
CN (1) CN111597099B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148576B (en) * 2020-09-28 2021-06-08 北京基调网络股份有限公司 Application performance monitoring method and system and storage medium
CN113778780B (en) * 2020-11-27 2024-05-17 北京京东尚科信息技术有限公司 Application stability determining method and device, electronic equipment and storage medium
CN114697319B (en) * 2020-12-30 2023-06-16 华为云计算技术有限公司 Tenant service management method and device for public cloud
CN115914039B (en) * 2022-11-22 2024-05-14 贵州电网有限责任公司 Network performance monitoring device and monitoring method thereof
CN116541261B (en) * 2023-07-06 2023-09-05 成都睿的欧科技有限公司 Resource management method and system based on cloud resource monitoring

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109075991A (en) * 2016-02-26 2018-12-21 诺基亚通信公司 Cloud verifying and test automation
CN109829689A (en) * 2019-01-14 2019-05-31 北京纷扬科技有限责任公司 A kind of cross-enterprise cooperation method and system based on PaaS system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140236935A1 (en) * 2013-02-20 2014-08-21 Thursday Market, Inc. Service Provider Matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109075991A (en) * 2016-02-26 2018-12-21 诺基亚通信公司 Cloud verifying and test automation
CN109829689A (en) * 2019-01-14 2019-05-31 北京纷扬科技有限责任公司 A kind of cross-enterprise cooperation method and system based on PaaS system

Also Published As

Publication number Publication date
CN111597099A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN111597099B (en) Non-invasive simulation method for monitoring running quality of application deployed on cloud platform
US12081419B2 (en) Automatic health check and performance monitoring for applications and protocols using deep packet inspection in a datacenter
Larsson et al. Impact of etcd deployment on Kubernetes, Istio, and application performance
CN110865867B (en) Method, device and system for discovering application topological relation
US20140215077A1 (en) Methods and systems for detecting, locating and remediating a congested resource or flow in a virtual infrastructure
US9311160B2 (en) Elastic cloud networking
US10333816B2 (en) Key network entity detection
US8782215B2 (en) Performance testing in a cloud environment
US20140201642A1 (en) User interface for visualizing resource performance and managing resources in cloud or distributed systems
US20140215058A1 (en) Methods and systems for estimating and analyzing flow activity and path performance data in cloud or distributed systems
US11675682B2 (en) Agent profiler to monitor activities and performance of software agents
US20180176115A1 (en) Vnf information obtaining method, apparatus, and system
US10461990B2 (en) Diagnostic traffic generation for automatic testing and troubleshooting
JP2015092354A (en) Method and system for evaluating resiliency of distributed computing service by inducing latency
WO2012025773A1 (en) Infrastructure model generation system and method
US10423439B1 (en) Automatic determination of a virtual machine's dependencies on storage virtualization
WO2018212928A1 (en) System and method for mapping a connectivity state of a network
CN103152229A (en) Dynamic configuration method for monitoring index item
US20140337471A1 (en) Migration assist system and migration assist method
US20180349166A1 (en) Migrating virtualized computing instances that implement a logical multi-node application
US20230214229A1 (en) Multi-tenant java agent instrumentation system
Shen et al. Network-centric distributed tracing with deepflow: Troubleshooting your microservices in zero code
US20240137278A1 (en) Cloud migration data analysis method using system process information, and system thereof
Aslanpour et al. Auto-scaling of web applications in clouds: A tail latency evaluation
Liu et al. Towards a community cloud storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230601

Address after: Energy Building 1-2201, No. 10777 Jingshi Road, Jinan Area, China (Shandong) Pilot Free Trade Zone, Jinan City, Shandong Province, 250013

Applicant after: Shandong Electronic Port Co.,Ltd.

Address before: Floor S06, Inspur Science Park, No. 1036, Inspur Road, hi tech Zone, Jinan City, Shandong Province

Applicant before: SHANDONG HUIMAO ELECTRONIC PORT Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant