US20200295986A1 - Dynamic monitoring on service health signals - Google Patents
Dynamic monitoring on service health signals Download PDFInfo
- Publication number
- US20200295986A1 US20200295986A1 US16/351,426 US201916351426A US2020295986A1 US 20200295986 A1 US20200295986 A1 US 20200295986A1 US 201916351426 A US201916351426 A US 201916351426A US 2020295986 A1 US2020295986 A1 US 2020295986A1
- Authority
- US
- United States
- Prior art keywords
- application
- time series
- cloud
- data
- errors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000004458 analytical method Methods 0.000 claims abstract description 23
- 230000008859 change Effects 0.000 claims description 26
- 238000010801 machine learning Methods 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 17
- 230000003068 static effect Effects 0.000 claims description 14
- 238000012545 processing Methods 0.000 description 15
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 11
- 238000011161 development Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 7
- 230000004048 modification Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000010348 incorporation Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 238000007418 data mining Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000003490 calendering Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
- H04L41/0813—Configuration setting characterised by the conditions triggering a change of settings
- H04L41/082—Configuration setting characterised by the conditions triggering a change of settings the condition being updates or upgrades of network functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1433—Saving, restoring, recovering or retrying at system level during software upgrading
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
- G06F11/0754—Error or fault detection not based on redundancy by exceeding limits
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1471—Saving, restoring, recovering or retrying involving logging of persistent data for recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/65—Updates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/34—Network arrangements or protocols for supporting network services or applications involving the movement of software or configuration parameters
Definitions
- Non-limiting examples of the present disclosure describe systems, methods and devices for automatically identifying software modifications in cloud-based application services and dynamically configuring operational monitors in those environments.
- the monitors may be automatically generated, modified and/or deleted based on analysis of time series information from a telemetry service associated with or integrated in the cloud-based application service.
- the telemetry service may receive operational data, including operation logs, from operations that are executed by users of the cloud-based application service.
- a determination may be made that monitors should be generated, modified and/or deleted by automatically comparing pre-software update time series data to post-software update time series data.
- a dynamic monitor engine may determine appropriate monitoring techniques for each corresponding operation type, baseline operational ranges, and/or thresholds for flagging operations for further review.
- the dynamic monitor engine may apply one or more machine learning models in making these determinations and setting these ranges and thresholds.
- the dynamic monitor engine may apply these models in the context of the processing resources that are available to the cloud-based application service, thereby allocating operational analysis bandwidth for each monitor according to the resources available in the system.
- FIG. 1 is a schematic diagram illustrating an example distributed computing environment for dynamically configuring monitors for a cloud-based application service.
- FIG. 2 illustrates a basic flow diagram for dynamically configuring monitors for a cloud-based applications service.
- FIG. 3 illustrates a graphical display of operation data for a cloud-based application service with a dynamically configured monitor applied to a quality of service metric.
- FIG. 4 illustrates a graphical display of operation data for a cloud-based application service with a dynamically configured monitor applied to a total unexpected failure metric.
- FIG. 5 is an exemplary method for dynamically configuring monitors for a cloud-based application service.
- FIGS. 6 and 7 are simplified diagrams of a mobile computing device with which aspects of the disclosure may be practiced.
- FIG. 8 is a block diagram illustrating example physical components of a computing device with which aspects of the disclosure may be practiced.
- FIG. 9 is a simplified block diagram of a distributed computing system in which aspects of the present disclosure may be practiced.
- Examples of the disclosure provide systems, methods, and devices for dynamically creating monitors for software implemented in cloud-based application services.
- the monitors may provide mechanisms for identifying code regressions in the implemented software and/or other service functionality loss (e.g., server issues, network problems, etc.).
- the code regressions may be included in new software packages, updates, and/or patches, for example.
- the code regressions may be associated with one or more cloud-based applications, such as cloud-based document processing applications, spreadsheet applications, calendar applications, presentation applications, storage applications, video applications, real-time electronic messaging applications, voice messaging applications, video communication applications, and/or email applications.
- a monitor may analyze signals associated with operation failures related to one or more cloud-based applications. For example, when an operation for a cloud-based application fails, and/or an operation for a cloud-based application causes an application crash or malfunction when it is performed, a signal indicating that there was an operation event, or operation failure, may be reported to the monitor and/or a telemetry service associated with the monitor.
- a monitor may receive and analyze data associated with execution of one or more operations of applications hosted by a cloud-based application service. For example, the monitor may analyze data associated with “save” operations, “send” operations, “new document” operations, “copy” operations, “paste” operations, and any other operation that may be performed by cloud-based applications.
- a telemetry service may receive and store information associated with operations that have been executed by applications hosted by a cloud-based service.
- the telemetry service may store information about each executed operation including: a time that each operation was executed; an identity of each operation that was executed; a duration of time that each operation took to complete or time out; whether each operation was successful or unsuccessful; an indication of whether a monitor is receiving data associated with each operation; a name of each monitor that is associated with each operation; a server or server farm that executed each operation, etc.
- the telemetry service may maintain a continuous time series of operational data that includes a number value of successfully executed operations, and a number value of unsuccessfully executed operations. This information can be utilized by monitors to flag operations that may be related to code regressions and/or other issues such as network and/or hardware problems.
- a telemetry analysis and comparison engine associated with the cloud-based application service may compare one or more time series of operational data from a duration of time just prior to the software update, with a time series corresponding to a time after the software update has been implemented into the service.
- the telemetry analysis and comparison engine may inspect the operation logs for each of the time series and compare them with one another to determine which operations were present in the pre-software update time series and which operations are present in the post-software update time series.
- a dynamic monitor engine may generate a new monitor for flagging potential issues associated with those newly added operations.
- the dynamic monitor engine may modify an existing monitor for flagging potential issues associated with those modified operations.
- the dynamic monitor engine may delete each corresponding monitor that was present in the pre-software update time series.
- the dynamic monitor engine may analyze time series operational data associated with the operations that the monitors are being generated and modified for. For example, when a new monitor is being generated for a new operation, the dynamic monitor engine may analyze operational data including operation logs for that new operation for a duration of time. That is, the dynamic monitor may analyze at least one time series for the operation.
- the dynamic monitor may utilize one or more machine learning models to determine a baseline that can be utilized by the monitor. The baseline may relate to a number of successfully or unsuccessfully executed operations over time, a percentage of successfully executed operations over time, and/or a duration of time that each operation in a set amount of time took to complete (i.e., latency).
- the dynamic monitor engine may utilize time series data to identify a normalized range for one or more of those metrics. For example, the dynamic monitor engine may determine that a baseline from 0-1000 unexpected failures per every five minutes should be established for a first operation; that a baseline between 50 milliseconds and 150 milliseconds for completing the operation should be established; and/or that a baseline of between 90-100% successfully executed operation requests should be established for the operation.
- a baseline that is established may be dynamic or static. That is, for dynamic baselines, the baseline may vary based on a variety of factors, including time of day, day of the week, month, and other contextual data. Alternatively, static baselines may remain constant regardless of the contextual data associated with them.
- the dynamic monitor engine may determine a threshold outside of the baseline for flagging an operation for further analysis and/or review.
- the thresholds may be manually set.
- a threshold may be set for a number of operational failures received per X minutes over a baseline number of operational failures received per X minutes.
- a threshold may be set for a percentage of operational successes per X minutes under the lower bounds of the baseline percentage that an operation must fall below for the monitor to flag that operation.
- a threshold may be set for a latency metric such that a certain duration of time over an upper baseline latency must be met for operations for the monitor to flag that operation.
- an operation may not be flagged unless a time series for that operation exceeds two or more thresholds (e.g., two or more of a latency threshold, an unsuccessful operation threshold, and/or a percentage of successes threshold).
- the systems, methods, and devices described herein provide technical advantages for detecting code regressions and issues that impact functionality in cloud-based application services.
- Memory and processing costs (CPU cycles) associated with accurately detecting code regressions in software builds for large cloud-based application services are reduced by the dynamic and automated nature of the described mechanisms. Additionally, time and human resources that would otherwise be needed to create, modify, and/or delete monitors for such services are reduced significantly.
- the described mechanisms provide a way for cloud-based services to automatically generate, modify and delete operational monitors on the fly such that when new software updates are rolled out to those services, the operations can be monitored for code regressions immediately and without taking up human developer time and resources.
- the dynamic nature of the described mechanisms also provides a way for each monitor to take the processing resources available to the system into account when determining how many time series should be analyzed for each operation, how long each analyzed time series should be, how many operations should and can be monitored based on the constraints of the system, as well as what baselines and thresholds should be applied for flagging each operation for further review (e.g., if fewer resources are available baselines and/or thresholds may be more generous).
- FIG. 1 is a schematic diagram illustrating an example distributed computing environment 100 for dynamically configuring monitors for a cloud-based application service.
- Distributed computing environment 100 includes service modification sub-environment 101 , network and processing sub environment 136 , and dynamic monitors sub-environment 130 .
- Network and processing sub-environment 136 comprises network 140 , via which any of the computing devices in distributed computing environment 100 may communicate with one another; server computing device 138 ; and time series data/event logs data storage 142 .
- a monitoring service that monitors operational data from one or more cloud-based applications may reside on one or more computing devices in network and processing sub-environment 136 , and the monitoring service may receive operational data from a telemetry service.
- the monitoring service and the telemetry service may be different services.
- the monitoring service and the telemetry service may be the same service.
- the telemetry service may receive operational data (e.g., operational success counts, operational failure counts, operation latency data, etc.), which can be utilized by one or more monitors of the monitoring service.
- one or more cloud-based applications may report operational errors to the telemetry service and monitors of the monitoring service may determine whether drops in quality of services associated with one or more cloud-based applications correspond to code regressions in new or modified operations included in new software builds, server issues and/or network problems.
- operational successes and errors may be automatically reported to a telemetry database by the cloud-based applications when they occur.
- an opt-in system may exist such that, at least in the production environments, users must op-in to allow the operational data to be automatically reported to the telemetry database.
- Service modification sub-environment 101 includes application development sub-environment 128 and application sub-environment 102 .
- Team A 132 is responsible for creating, modifying and/or maintaining software and software updates for application A 104 .
- Team B 134 is responsible for creating, modifying and/or maintaining software and software updates for application B 112 .
- Team C 136 is responsible for creating, modifying and/or maintaining software and software updates for application C 120 .
- each of the development teams have rolled out a software update for their respective applications, which is illustrated by software update 127 , which is pushed to those respective applications via the cloud-based application service operating in network and processing sub-environment 136 .
- the previous build of application A 104 prior to implementation of software update 127 included operation1 106 , operation2 108 and operation3 110 .
- Operation3 110 is shown with dashed lines around it to illustrate that it has been removed from application A 104 via software update 127 .
- the previous build of application B 112 prior to implementation of software update 127 included operation4 114 and operation5 116 .
- Operation6 118 has been added to application B 112 via implementation of software update 127 .
- Application C 120 prior to implementation of software update 127 included operation7 122 , operation8 124 and operation9 126 .
- Operation9 126 is shown with dotted lines around it to illustrate that it has been modified from a previous version via implementation of software update 127 .
- a telemetry service operating in network and processing sub-environment may store time series data and event logs for operations associated with the cloud-based application service in time series data/event logs database 136 .
- the telemetry service may receive operational data for a plurality of operations associated with one or more cloud-based applications, and store that information in addition to the time that the information was received and/or that each corresponding operational event occurred.
- One or more monitors associated with the cloud-based application service may analyze operational time series data from the telemetry service to determine whether code regressions exist in software for one or more of the applications hosted by the cloud-based application service.
- old monitors 132 correspond to each monitor that was monitoring telemetry data for the cloud-based application service prior to implementation of software update 127 . That is, monitor M1 106 * monitored telemetry data for operation1 106 for application A 104 ; monitor M2 108 * monitored telemetry data for operation2 108 for application A 104 ; monitor M3 110 * monitored telemetry data for operation3 110 for application A 104 ; monitor M4 114 * monitored telemetry data for operation4 114 for application B 112 ; monitor M5 116 * monitored telemetry data for operation5 116 for application B 112 ; monitor M7 122 * monitored telemetry data for operation7 122 for application C 120 ; operation M8 122 * monitored telemetry data for operation8 124 for application C; and operation M9 126 * monitored telemetry data for operation9 126 for application C.
- a telemetry analysis and comparison engine associated with the telemetry service and/or monitor service may analyze time series data and event logs from operational events that took place in a time series prior to implementation of software update 127 and compare that data with time series data and event logs from operational events that took place after implementation of software update 127 .
- the time series data and/or event logs may include a name of each operation that was executed in relation to the cloud-based application service, an indication of whether each operational event was successful or unsuccessful, a duration of time that each operation took to complete and/or time out, a designation of whether a monitor exists for monitoring each operation, and/or a designation of a specific monitor that is monitoring each operation if such a monitor exists.
- the telemetry analysis and comparison engine may make a determination as to which operations have been deleted, added and/or modified by implementation of software update 127 . Additionally, the telemetry analysis and comparison engine may make a determination based on the time series comparison as to which operations in the post-software update time series have existing monitors, which operations in the post-software update time series need to be updated based on operations being modified via implementation of software update 127 , which operations in the post-software update need new monitors to be created for them because they have been added via implementation of software update 127 , and/or whether and which monitors need to be deleted because operations have been deleted via implementation of software update 127 .
- a dynamic monitor engine may modify one or more existing monitors, delete one or more existing monitors and/or generate one or more new monitors. That is, for each operation that was deleted via software update 127 for which a monitor existed, the dynamic monitor engine may delete that monitor; for each new operation that was added via software update 127 , dynamic monitor engine may generate a new monitor; and for each operation that was modified via software update 127 for which a monitor existed, the dynamic monitor engine may modify the corresponding monitor.
- the dynamic monitor engine may apply one or more machine learning models (e.g., Holt Winters, principal component analysis, etc.) to telemetry datasets to determine one or more baselines for operation success or failure levels and/or counts, and/or baselines for operation execution latency.
- one or more machine learning models may be applied to one or more post-software update time series of operational data and determine a baseline that success, error and/or latency datapoints from that time series fall into.
- the baseline may be dynamic in that it changes based on one or more factors (e.g., time of day, date, month, context, etc.). In other examples, the baseline may be static in that it does not change regardless of the factors that are associated with it.
- a dynamic baseline may be generated for an operation failure count and/or percentage of operational failures where a number of operational failures and/or percentage of operational failures increase during various times of day and/or night.
- a static baseline may be generated for an operation failure count and/or percentage of operational failures where a number and/or percentage of operational failures remain relatively static over the course of a day.
- a threshold value outside of that baseline may be identified for flagging time series data that exceeds that threshold.
- a monitor for an operation may identify a threshold number of operational failures outside of a baseline failure number that a time series must exceed over a threshold duration of time (i.e., “time window”) for the monitor to flag the time series as potentially relating to a code regression or otherwise being potentially problematic.
- a monitor for an operation may identify a threshold percentage of operational failures outside of a baseline failure percentage that a time series must exceed over a threshold duration of time for the monitor to flag the time series as potentially relating to a code regression or otherwise being potentially problematic.
- a monitor for an operation may identify a threshold duration of time outside of a baseline latency duration that operations for a time series must exceed for the monitor to flag the time series as potentially relating to a code regression or otherwise being potentially problematic.
- the threshold values may be automatically determined based on one or more machine learning models and/or the threshold values may be manually selected.
- the threshold values may be dynamic or static. In some examples, the thresholds may be set to zero (i.e., any readings over or under a baseline may result in operation flagging).
- new monitors 134 correspond to the monitors that exist for operations post-incorporation of software update 127 .
- monitors M1 106 *, M2 108 *, M4 114 *, M5 116 *, M7 122 * and M8 124 * remain post-software update 127 because their corresponding operations have not been modified or deleted
- monitor M3 110 * has been deleted (as indicated by the dashed line surrounding it) because its corresponding operation (operation3 110 ) has been deleted via software update 127
- monitor M9 126 * has been modified (as indicated by the dotted line surrounding it) because its corresponding operation (operation9 126 ) has been dynamically modified via software update 127
- monitor M6 118 * has been dynamically generated by the dynamic monitor engine because its corresponding operation (operation6 118 ) has been added to application B 112 via software update 127 .
- FIG. 2 illustrates a basic flow diagram 200 for dynamically configuring monitors for a cloud-based application service.
- Flow diagram 200 includes application development sub-environment 228 and software update 227 .
- Application sub-environment 228 includes three development teams (team A, team B and team C), which are all software development teams that create and update software for the cloud-based application service.
- the teams have rolled out a new software update (software update 227 ) which is being applied to the cloud-based application service.
- the software update has been applied to one or more cloud-based applications, and time series data for the applications prior to and after the software update has been stored in time series data/event logs database 202 .
- Log datamining element 204 comprises the analysis and comparison of operation logs from one or more pre-software update time series and one or more post-software update time series.
- a comparison engine may identify which operations and corresponding monitors existed in the cloud-based application service and/or telemetry service pre-software update and compare those operations and monitors to operations and monitors that exist in the cloud-based application service and/or telemetry service post-software update.
- Comparison results element 205 includes old operations 206 (i.e., operations that existed pre-software update and/or that still exist from the previous software build post-software update), new operations 208 (i.e., operations that have been added via software update 227 ), and modified operations 210 (i.e., operations that existed pre-software update and that have been modified via incorporation of software update 227 ).
- old operations 206 i.e., operations that existed pre-software update and/or that still exist from the previous software build post-software update
- new operations 208 i.e., operations that have been added via software update 227
- modified operations 210 i.e., operations that existed pre-software update and that have been modified via incorporation of software update 227 ).
- New monitor generation element 212 illustrates the generation of new monitors that are dynamically created via application of machine learning models to time series data for operations that are new to the cloud-based application service based on implementation of software update 227 .
- New monitor generation element 212 also illustrates the deletion of existing monitors that are no longer useful due to deletion of operations from the cloud-based application service based on their removal via software update 227 .
- new monitor generation element 212 illustrates the modification of exiting monitors that are modified to accurately flag code regressions based on the corresponding operations that have been modified via software update 227 .
- FIG. 3 illustrates a graphical display 302 of operation data for a cloud-based application service with a dynamically configured monitor applied to a quality of service metric.
- the data corresponding to graphical display 302 may be associated with a telemetry service that receives and stores operational event data and operation event logs for the cloud-based application service.
- the telemetry service may keep logs of each operation that has been initiated by a user of a cloud-based application hosted on the service.
- the logs may include an identity of each operation that was initiated, whether each initiated operation was successfully or unsuccessfully executed, a duration of time that each operation took to complete or time out, an identity of a server or server farm that executed each request associated with each operation, an indication of whether a monitor is associated with each initiated operation, and/or an identity of each monitor that is associated with each initiated operation.
- the telemetry service may take the raw operational event data that it collects/receives and generate one or more graphs of the data, such as the graph shown on graphical display 302 .
- the graph in FIG. 3 is a quality of service graph (for the cloud-based application service) that illustrates a percentage of initiated operations of a specific operation type that have been executed successfully (on the Y-axis) over a duration of time (on the X-axis).
- the dynamic monitor engine has automatically generated a new monitor for the specific operation that is represented in the graph. In doing so, the dynamic monitor engine has determined and set a baseline percentage 304 corresponding to a successful operation range for the specific operation that the operation typically falls within.
- This range which is illustrated by the diagonal lined rectangle of the baseline percentage 304 from 85 percent to 100 percent, is shown as being static for the duration of time included in the graph.
- the baseline may be dynamic or static.
- the illustrated baseline may relate to a day-time normalized range for the specific operation, and the baseline may change for an evening-time normalized range for the specific operation.
- a threshold 308 has been identified and set either by a machine learning mechanism or a manual interaction with the telemetry service.
- Threshold 308 corresponds to a percentage outside of the baseline that the quality of service has to drop to for the telemetry service to flag the specific operation as potentially being an issue (e.g., the specific operation potentially relating to a code regression).
- the threshold 308 has been set to 77.5 percent (i.e., 7.5 percent below the lower bounds of baseline percentage 304 ).
- the monitor for the specific operation would flag the portion 306 of the graph that has fallen below threshold 308 .
- the specific operation has been flagged due to identification of the portion 306 of the graph falling below threshold 308 , that time series data and/or time series data surrounding that time series data may be provided for analysis of the flagged issue.
- FIG. 4 illustrates a graphical display 402 of operation data for a cloud-based application service with a dynamically configured monitor applied to a total unexpected failure metric.
- the data corresponding to graphical display 402 may be associated with a telemetry service that receives and stores operational event data and operation event logs for the cloud-based application service.
- the telemetry service may keep logs of each operation that has been initiated by a user of a cloud-based application hosted on the service.
- the logs may include an identity of each operation that was initiated, whether each initiated operation was successfully or unsuccessfully executed, a duration of time that each operation took to complete or time out, an identity of a server or server farm that executed each request associated with each operation, an indication of whether a monitor is associated with each initiated operation, and/or an identity of each monitor that is associated with each initiated operation.
- the telemetry service may take the raw operational event data that it collects/receives and generate one or more graphs of the data, such as the graph shown on graphical display 402 .
- the graph in FIG. 4 is an unexpected failure graph (for the cloud-based application service) that illustrates a number of initiated operations of a specific operation type that have resulted in an unexpected failure (on the Y-axis) over a duration of time (on the X-axis).
- the dynamic monitor engine has automatically generated a new monitor for the specific operation that is represented in the graph. In doing so, the dynamic monitor engine has determined and set a baseline number range 404 of unexpected failures for the specific operation. This spread of numbers corresponds to a range of unexpected failures for the specific operation that the specific operation typically falls within.
- This range which is illustrated by the diagonal lined rectangle on graphical display 402 , has been set to 0 unexpected failures for the specific operation to 1750 unexpected failures for the specific operation.
- This range may have been identified based on application of one or more machine learning models to time series data for the specific operation.
- the range is a static range; however, it should be understood that the baseline may be dynamic or static.
- the illustrated baseline may relate to a day-time normalized range for the specific operation, and the baseline may change for an evening-time normalized range for the specific operation (e.g., more or less users may utilize the specific operation during certain hours compared with certain other hours).
- a threshold 408 has been identified and set either by application of a machine learning model to one or more time series for the specific operation or a manual interaction with the telemetry service.
- Threshold 408 corresponds to a minimum number of total unexpected failures above the baseline that needs to be reached in a time window of a time series for the telemetry service to flag the specific operation as potentially being an issue (e.g., the specific operation potentially relating to a code regression).
- the threshold 408 has been set to 3000 unexpected failures (i.e., 1250 unexpected errors more than the upper bounds of the baseline).
- the monitor for the specific operation would flag the portion 406 of the graph that is above threshold 408 .
- the specific operation has been flagged due to identification of the portion 406 of the graph that is above threshold 408 , that time series data and/or time series data surrounding that time series data may be provided for analysis of the flagged issue.
- FIG. 5 is an exemplary method 500 for dynamically configuring monitors for a cloud-based application service.
- the method 500 begins at a start operation and flow moves to operation 502 .
- telemetry data for a plurality of operations for the cloud-based application service is analyzed.
- the analysis may comprise comparing a first time series with a second time series, where the data from the second time series relates to operations that were executed prior in time compared with execution of operations related to the first time series.
- Each analyzed time series may comprise operational event data from operations associated with the cloud-based application service.
- That operational event data may include an identity/name of each operation that was initiated, executed, and/or completed; a time of occurrence that each operation was initiated, executed, and/or completed; an indication of whether a monitor exists for each operation that was initiated, executed, and/or completed; an identity of each monitor that exists for each operation that was initiated, executed, and/or completed; a duration of time that each operation took to complete and/or time out; and/or an identity of a server or server farm that attempted execution of each operation.
- the comparison of the two time series may be automatically initiated based on an indication that a software update has been provided to the cloud-based application service.
- the second time series may correspond to a duration of time prior to incorporation of the software update in the cloud-based application service
- the first time series may correspond to a duration of time after incorporation of the software update in the cloud-based application service.
- operation 504 one or more operational changes in the cloud-based application service are identified.
- the identification is based on the time series comparison performed at operation 502 .
- the identification is made by determining whether operations have been added, deleted and/or modified based on the software update to the cloud-based application service. For example, when a software update is incorporated into the cloud-based application service, one or more operations that were included in one or more applications hosted by the cloud-based application service prior to the software update may be modified, added, or deleted based on the incorporation of that software update.
- At least one telemetry monitor is dynamically configured based on the one or more operational changes that were identified at operation 502 .
- the dynamic configuration may be performed by a dynamic monitor engine and application of one or more machine learning models to time series data as discussed herein.
- the dynamic configuration may comprise selecting an appropriate monitoring technique to detect failure patterns for each new operation associated with the one or more operational changes.
- the dynamically configuring may additionally or alternatively comprise automatically defining a baseline failure rate for each new operation associated with the one or more operation changes and/or defining a threshold failure rate from the baseline failure rate for each new operation associated with the one or more operational changes.
- dynamically configuring the at least one telemetry monitor may include automatically defining a time series window.
- the time series window may comprise operations executed in a duration of time that each telemetry monitor for each new operation associated with the one or more operational changes will monitor.
- the dynamic monitor engine may configure monitors to have static and/or dynamic baselines and/or thresholds.
- a number of the one or more new telemetry monitors for generation may be automatically determined based on available bandwidth for monitoring operations in the cloud-based application service.
- a number of time series and/or duration of each time series that is monitored may be automatically determined based on available bandwidth for monitoring operations in the cloud-based application service.
- the processing resources available to the telemetry service may dictate that only a set number of time series may be analyzed over a set duration of time. Therefore, when the monitors are dynamically created, the number of time series they process over that timeframe may be automatically calculated and built into each corresponding monitor.
- FIGS. 6 and 7 illustrate a mobile computing device 600 , for example, a mobile telephone, a smart phone, wearable computer, a tablet computer, an e-reader, a laptop computer, AR compatible computing device, or a VR computing device, with which embodiments of the disclosure may be practiced.
- a mobile computing device 600 for implementing the aspects is illustrated.
- the mobile computing device 600 is a handheld computer having both input elements and output elements.
- the mobile computing device 600 typically includes a display 605 and one or more input buttons 610 that allow the user to enter information into the mobile computing device 600 .
- the display 605 of the mobile computing device 600 may also function as an input device (e.g., a touch screen display).
- an optional side input element 615 allows further user input.
- the side input element 615 may be a rotary switch, a button, or any other type of manual input element.
- mobile computing device 600 may incorporate more or fewer input elements.
- the display 605 may not be a touch screen in some embodiments.
- the mobile computing device 600 is a portable phone system, such as a cellular phone.
- the mobile computing device 600 may also include an optional keypad 635 .
- Optional keypad 635 may be a physical keypad or a “soft” keypad generated on the touch screen display.
- the output elements include the display 605 for showing a graphical user interface (GUI), a visual indicator 620 (e.g., a light emitting diode), and/or an audio transducer 625 (e.g., a speaker).
- GUI graphical user interface
- the mobile computing device 600 incorporates a vibration transducer for providing the user with tactile feedback.
- the mobile computing device 600 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device.
- FIG. 7 is a block diagram illustrating the architecture of one aspect of a mobile computing device. That is, the mobile computing device 700 can incorporate a system (e.g., an architecture) 702 to implement some aspects.
- the system 702 is implemented as a “smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players).
- the system 702 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone.
- PDA personal digital assistant
- One or more application programs 766 may be loaded into the memory 762 and run on or in association with the operating system 864 .
- Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth.
- the system 702 also includes a non-volatile storage area 768 within the memory 762 .
- the non-volatile storage area 768 may be used to store persistent information that should not be lost if the system 702 is powered down.
- the application programs 766 may use and store information in the non-volatile storage area 768 , such as e-mail or other messages used by an e-mail application, and the like.
- a synchronization application (not shown) also resides on the system 702 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in the non-volatile storage area 768 synchronized with corresponding information stored at the host computer.
- other applications may be loaded into the memory 762 and run on the mobile computing device 700 , including instructions for providing and operating a digital assistant computing platform.
- the system 702 has a power supply 770 , which may be implemented as one or more batteries.
- the power supply 770 might further include an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries.
- the system 702 may also include a radio interface layer 772 that performs the function of transmitting and receiving radio frequency communications.
- the radio interface layer 772 facilitates wireless connectivity between the system 702 and the “outside world,” via a communications carrier or service provider. Transmissions to and from the radio interface layer 772 are conducted under control of the operating system 764 . In other words, communications received by the radio interface layer 772 may be disseminated to the application programs 766 via the operating system 764 , and vice versa.
- the visual indicator 620 may be used to provide visual notifications, and/or an audio interface 774 may be used for producing audible notifications via the audio transducer 625 .
- the visual indicator 620 is a light emitting diode (LED) and the audio transducer 625 is a speaker.
- LED light emitting diode
- the LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device.
- the audio interface 774 is used to provide audible signals to and receive audible signals from the user.
- the audio interface 774 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation.
- the microphone may also serve as an audio sensor to facilitate control of notifications, as will be described below.
- the system 702 may further include a video interface 776 that enables an operation of an on-board camera 630 to record still images, video stream, and the like.
- a mobile computing device 700 implementing the system 702 may have additional features or functionality.
- the mobile computing device 700 may also include additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape.
- additional storage is illustrated in FIG. 7 by the non-volatile storage area 768 .
- Data/information generated or captured by the mobile computing device 700 and stored via the system 702 may be stored locally on the mobile computing device 700 , as described above, or the data may be stored on any number of storage media that may be accessed by the device via the radio interface layer 772 or via a wired connection between the mobile computing device 700 and a separate computing device associated with the mobile computing device 700 , for example, a server computer in a distributed computing network, such as the Internet.
- a server computer in a distributed computing network such as the Internet.
- data/information may be accessed via the mobile computing device 700 via the radio interface layer 772 or via a distributed computing network.
- data/information may be readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems.
- FIG. 8 is a block diagram illustrating physical components (e.g., hardware) of a computing device 800 with which aspects of the disclosure may be practiced.
- the computing device components described below may have computer executable instructions for dynamically configuring operation monitors for a cloud-based application service.
- the computing device 800 may include at least one processing unit 802 and a system memory 804 .
- the system memory 804 may comprise, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories.
- the system memory 804 may include an operating system 805 suitable for running one or more operation monitoring programs.
- the operating system 805 may be suitable for controlling the operation of the computing device 800 .
- embodiments of the disclosure may be practiced in conjunction with a graphics library, other operating systems, or any other application program and is not limited to any particular application or system.
- This basic configuration is illustrated in FIG. 8 by those components within a dashed line 808 .
- the computing device 800 may have additional features or functionality.
- the computing device 800 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape.
- additional storage is illustrated in FIG. 8 by a removable storage device 809 and a non-removable storage device 810 .
- a number of program modules and data files may be stored in the system memory 804 .
- the program modules 806 e.g., real-time code defect telemetry application application 820
- the program modules 806 may perform processes including, but not limited to, the aspects, as described herein.
- telemetry analysis and comparison engine 811 may identify one or more pre-software update time series and compare those time series to one or more post-software update time series to determine whether new operational monitors should be generated, existing operational monitors should be modified, and/or existing operational monitors should be deleted.
- Dynamic monitor engine 813 may perform one or more operations associated with generating new monitors and modifying existing monitors for operations in a cloud-based application service.
- Log analysis engine 815 may perform one or more operations associated with analyzing operational logs from one or more time series and applying that information in machine learning models to create and/or modify monitors, and/or to determine what operational changes have been made via implementation of software updates to a cloud-based application service.
- Monitor allocation engine 817 may perform one or more operations associated with determining the system resources available to a cloud-based application service and/or a telemetry service, and setting operational monitoring criteria for each monitor in the system (e.g., how many time series are monitored for each operation, how long the time series are, etc.).
- embodiments of the disclosure may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors.
- embodiments of the disclosure may be practiced via a system-on-a-chip (SOC) where each or many of the components illustrated in FIG. 8 may be integrated onto a single integrated circuit.
- SOC system-on-a-chip
- Such an SOC device may include one or more processing units, graphics units, communications units, system virtualization units and various application functionality all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit.
- the functionality, described herein, with respect to the capability of client to switch protocols may be operated via application-specific logic integrated with other components of the computing device 800 on the single integrated circuit (chip).
- Embodiments of the disclosure may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies.
- embodiments of the disclosure may be practiced within a general purpose computer or in any other circuits or systems.
- the computing device 800 may also have one or more input device(s) 812 such as a keyboard, a mouse, a pen, a sound or voice input device, a touch or swipe input device, etc.
- the output device(s) 814 such as a display, speakers, a printer, etc. may also be included.
- the aforementioned devices are examples and others may be used.
- the computing device 800 may include one or more communication connections 816 allowing communications with other computing devices 850 . Examples of suitable communication connections 816 include, but are not limited to, radio frequency (RF) transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports.
- RF radio frequency
- USB universal serial bus
- Computer readable media may include computer storage media.
- Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules.
- the system memory 804 , the removable storage device 809 , and the non-removable storage device 810 are all computer storage media examples (e.g., memory storage).
- Computer storage media may include RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by the computing device 800 . Any such computer storage media may be part of the computing device 800 .
- Computer storage media does not include a carrier wave or other propagated or modulated data signal.
- Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
- modulated data signal may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal.
- communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
- RF radio frequency
- FIG. 9 illustrates one aspect of the architecture of a system for processing data received at a computing system from a remote source, such as a personal/general computer 904 , tablet computing device 906 , or mobile computing device 908 , as described above.
- Content displayed at server device 902 may be stored in different communication channels or other storage types. For example, various documents may be stored using a directory service 922 , a web portal 924 , a mailbox service 926 , an instant messaging store 928 , or a social networking site 930 .
- the program modules 806 may be employed by a client that communicates with server device 902 , and/or the program modules 806 may be employed by server device 902 .
- the server device 902 may provide data to and from a client computing device such as a personal/general computer 904 , a tablet computing device 906 and/or a mobile computing device 908 (e.g., a smart phone) through a network 915 .
- a client computing device such as a personal/general computer 904 , a tablet computing device 906 and/or a mobile computing device 908 (e.g., a smart phone) through a network 915 .
- the computer system described above with respect to FIGS. 6-8 may be embodied in a personal/general computer 904 , a tablet computing device 906 and/or a mobile computing device 908 (e.g., a smart phone). Any of these embodiments of the computing devices may obtain content from the store 916 , in addition to receiving graphical data useable to be either pre-processed at a graphic-originating system, or post-processed at a receiving computing system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Environmental & Geological Engineering (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Security & Cryptography (AREA)
- Debugging And Monitoring (AREA)
- Stored Programmes (AREA)
Abstract
Description
- As computing has increasingly moved toward the cloud, systems that support large numbers of users and the cloud-based applications that they utilize are constantly being modified. The infrastructure of cloud-based systems requires constant monitoring just to maintain, not to mention to update and add additional features and functionality. As new software builds are continuously added to the infrastructure of cloud-based systems through development rings, and eventually production environments, it becomes increasingly resource intensive and difficult to modify existing monitors and create new ones to keep up with the ever-changing builds.
- It is with respect to this general technical environment that aspects of the present technology disclosed herein have been contemplated. Furthermore, although a general environment has been discussed, it should be understood that the examples described herein should not be limited to the general environment identified in the background.
- This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Additional aspects, features, and/or advantages of examples will be set forth in part in the description which follows and, in part, will be apparent from the description or may be learned by practice of the disclosure.
- Non-limiting examples of the present disclosure describe systems, methods and devices for automatically identifying software modifications in cloud-based application services and dynamically configuring operational monitors in those environments. The monitors may be automatically generated, modified and/or deleted based on analysis of time series information from a telemetry service associated with or integrated in the cloud-based application service. The telemetry service may receive operational data, including operation logs, from operations that are executed by users of the cloud-based application service. A determination may be made that monitors should be generated, modified and/or deleted by automatically comparing pre-software update time series data to post-software update time series data. When new monitors are dynamically generated and/or modified, a dynamic monitor engine may determine appropriate monitoring techniques for each corresponding operation type, baseline operational ranges, and/or thresholds for flagging operations for further review. The dynamic monitor engine may apply one or more machine learning models in making these determinations and setting these ranges and thresholds. The dynamic monitor engine may apply these models in the context of the processing resources that are available to the cloud-based application service, thereby allocating operational analysis bandwidth for each monitor according to the resources available in the system.
- Non-limiting and non-exhaustive examples are described with reference to the following figures:
-
FIG. 1 is a schematic diagram illustrating an example distributed computing environment for dynamically configuring monitors for a cloud-based application service. -
FIG. 2 illustrates a basic flow diagram for dynamically configuring monitors for a cloud-based applications service. -
FIG. 3 illustrates a graphical display of operation data for a cloud-based application service with a dynamically configured monitor applied to a quality of service metric. -
FIG. 4 illustrates a graphical display of operation data for a cloud-based application service with a dynamically configured monitor applied to a total unexpected failure metric. -
FIG. 5 is an exemplary method for dynamically configuring monitors for a cloud-based application service. -
FIGS. 6 and 7 are simplified diagrams of a mobile computing device with which aspects of the disclosure may be practiced. -
FIG. 8 is a block diagram illustrating example physical components of a computing device with which aspects of the disclosure may be practiced. -
FIG. 9 is a simplified block diagram of a distributed computing system in which aspects of the present disclosure may be practiced. - Various embodiments will be described in detail with reference to the drawings, wherein like reference numerals represent like parts and assemblies throughout the several views. Reference to various embodiments does not limit the scope of the claims attached hereto. Additionally, any examples set forth in this specification are not intended to be limiting and merely set forth some of the many possible embodiments for the appended claims.
- The various embodiments and examples described above are provided by way of illustration only and should not be construed to limit the claims attached hereto. Those skilled in the art will readily recognize various modifications and changes that may be made without following the example embodiments and applications illustrated and described herein, and without departing from the true spirit and scope of the claims.
- Examples of the disclosure provide systems, methods, and devices for dynamically creating monitors for software implemented in cloud-based application services. In examples, the monitors may provide mechanisms for identifying code regressions in the implemented software and/or other service functionality loss (e.g., server issues, network problems, etc.). The code regressions may be included in new software packages, updates, and/or patches, for example. In some examples, the code regressions may be associated with one or more cloud-based applications, such as cloud-based document processing applications, spreadsheet applications, calendar applications, presentation applications, storage applications, video applications, real-time electronic messaging applications, voice messaging applications, video communication applications, and/or email applications.
- In some examples, a monitor may analyze signals associated with operation failures related to one or more cloud-based applications. For example, when an operation for a cloud-based application fails, and/or an operation for a cloud-based application causes an application crash or malfunction when it is performed, a signal indicating that there was an operation event, or operation failure, may be reported to the monitor and/or a telemetry service associated with the monitor. A monitor may receive and analyze data associated with execution of one or more operations of applications hosted by a cloud-based application service. For example, the monitor may analyze data associated with “save” operations, “send” operations, “new document” operations, “copy” operations, “paste” operations, and any other operation that may be performed by cloud-based applications.
- A telemetry service may receive and store information associated with operations that have been executed by applications hosted by a cloud-based service. The telemetry service may store information about each executed operation including: a time that each operation was executed; an identity of each operation that was executed; a duration of time that each operation took to complete or time out; whether each operation was successful or unsuccessful; an indication of whether a monitor is receiving data associated with each operation; a name of each monitor that is associated with each operation; a server or server farm that executed each operation, etc. Thus, the telemetry service may maintain a continuous time series of operational data that includes a number value of successfully executed operations, and a number value of unsuccessfully executed operations. This information can be utilized by monitors to flag operations that may be related to code regressions and/or other issues such as network and/or hardware problems.
- According to examples, when a software update is pushed to the cloud-based application service (e.g., one or more applications hosted by the service are modified), a telemetry analysis and comparison engine associated with the cloud-based application service may compare one or more time series of operational data from a duration of time just prior to the software update, with a time series corresponding to a time after the software update has been implemented into the service. The telemetry analysis and comparison engine may inspect the operation logs for each of the time series and compare them with one another to determine which operations were present in the pre-software update time series and which operations are present in the post-software update time series. In this manner, a determination can be made as to which operations in the post-software update time series have been added, modified and/or deleted compared with the pre-software update time series. For any operations that have been added via a software update, a dynamic monitor engine may generate a new monitor for flagging potential issues associated with those newly added operations. For any operations that have been modified via a software update, the dynamic monitor engine may modify an existing monitor for flagging potential issues associated with those modified operations. For any operations that have been deleted via a software update, the dynamic monitor engine may delete each corresponding monitor that was present in the pre-software update time series.
- In dynamically generating new monitors and modifying existing monitors, the dynamic monitor engine may analyze time series operational data associated with the operations that the monitors are being generated and modified for. For example, when a new monitor is being generated for a new operation, the dynamic monitor engine may analyze operational data including operation logs for that new operation for a duration of time. That is, the dynamic monitor may analyze at least one time series for the operation. The dynamic monitor may utilize one or more machine learning models to determine a baseline that can be utilized by the monitor. The baseline may relate to a number of successfully or unsuccessfully executed operations over time, a percentage of successfully executed operations over time, and/or a duration of time that each operation in a set amount of time took to complete (i.e., latency). The dynamic monitor engine may utilize time series data to identify a normalized range for one or more of those metrics. For example, the dynamic monitor engine may determine that a baseline from 0-1000 unexpected failures per every five minutes should be established for a first operation; that a baseline between 50 milliseconds and 150 milliseconds for completing the operation should be established; and/or that a baseline of between 90-100% successfully executed operation requests should be established for the operation. These are simply examples and it should be understood that various machine learning models applied to different datasets may dynamically establish different normalized baselines. Additionally, each baseline that is established may be dynamic or static. That is, for dynamic baselines, the baseline may vary based on a variety of factors, including time of day, day of the week, month, and other contextual data. Alternatively, static baselines may remain constant regardless of the contextual data associated with them.
- In some examples, in generating and/or modifying monitors, the dynamic monitor engine may determine a threshold outside of the baseline for flagging an operation for further analysis and/or review. In other examples, the thresholds may be manually set. For example, a threshold may be set for a number of operational failures received per X minutes over a baseline number of operational failures received per X minutes. In another example, a threshold may be set for a percentage of operational successes per X minutes under the lower bounds of the baseline percentage that an operation must fall below for the monitor to flag that operation. In still another example, a threshold may be set for a latency metric such that a certain duration of time over an upper baseline latency must be met for operations for the monitor to flag that operation. In additional examples, an operation may not be flagged unless a time series for that operation exceeds two or more thresholds (e.g., two or more of a latency threshold, an unsuccessful operation threshold, and/or a percentage of successes threshold).
- The systems, methods, and devices described herein provide technical advantages for detecting code regressions and issues that impact functionality in cloud-based application services. Memory and processing costs (CPU cycles) associated with accurately detecting code regressions in software builds for large cloud-based application services are reduced by the dynamic and automated nature of the described mechanisms. Additionally, time and human resources that would otherwise be needed to create, modify, and/or delete monitors for such services are reduced significantly. The described mechanisms provide a way for cloud-based services to automatically generate, modify and delete operational monitors on the fly such that when new software updates are rolled out to those services, the operations can be monitored for code regressions immediately and without taking up human developer time and resources. Even with robust developer pools to draw from, without the currently described mechanisms, it is difficult if not impossible to keep monitors current for continuously changing software in large cloud-based application services. The dynamic nature of the described mechanisms also provides a way for each monitor to take the processing resources available to the system into account when determining how many time series should be analyzed for each operation, how long each analyzed time series should be, how many operations should and can be monitored based on the constraints of the system, as well as what baselines and thresholds should be applied for flagging each operation for further review (e.g., if fewer resources are available baselines and/or thresholds may be more generous).
-
FIG. 1 is a schematic diagram illustrating an example distributedcomputing environment 100 for dynamically configuring monitors for a cloud-based application service. Distributedcomputing environment 100 includesservice modification sub-environment 101, network andprocessing sub environment 136, and dynamic monitors sub-environment 130. Network andprocessing sub-environment 136 comprisesnetwork 140, via which any of the computing devices in distributedcomputing environment 100 may communicate with one another;server computing device 138; and time series data/eventlogs data storage 142. - A monitoring service that monitors operational data from one or more cloud-based applications may reside on one or more computing devices in network and processing sub-environment 136, and the monitoring service may receive operational data from a telemetry service. In some examples, the monitoring service and the telemetry service may be different services. In other examples, the monitoring service and the telemetry service may be the same service. The telemetry service may receive operational data (e.g., operational success counts, operational failure counts, operation latency data, etc.), which can be utilized by one or more monitors of the monitoring service. For example, one or more cloud-based applications may report operational errors to the telemetry service and monitors of the monitoring service may determine whether drops in quality of services associated with one or more cloud-based applications correspond to code regressions in new or modified operations included in new software builds, server issues and/or network problems. In some examples, operational successes and errors may be automatically reported to a telemetry database by the cloud-based applications when they occur. In other examples, an opt-in system may exist such that, at least in the production environments, users must op-in to allow the operational data to be automatically reported to the telemetry database.
-
Service modification sub-environment 101 includesapplication development sub-environment 128 andapplication sub-environment 102. There are three development teams in application development environment 128 (although there could be more or fewer).Team A 132 is responsible for creating, modifying and/or maintaining software and software updates forapplication A 104.Team B 134 is responsible for creating, modifying and/or maintaining software and software updates forapplication B 112.Team C 136 is responsible for creating, modifying and/or maintaining software and software updates forapplication C 120. In this example, each of the development teams have rolled out a software update for their respective applications, which is illustrated bysoftware update 127, which is pushed to those respective applications via the cloud-based application service operating in network andprocessing sub-environment 136. - In the illustrated example, the previous build of
application A 104 prior to implementation ofsoftware update 127 includedoperation1 106,operation2 108 andoperation3 110.Operation3 110 is shown with dashed lines around it to illustrate that it has been removed fromapplication A 104 viasoftware update 127. The previous build ofapplication B 112 prior to implementation ofsoftware update 127 includedoperation4 114 andoperation5 116.Operation6 118 has been added toapplication B 112 via implementation ofsoftware update 127.Application C 120 prior to implementation ofsoftware update 127 includedoperation7 122,operation8 124 andoperation9 126.Operation9 126 is shown with dotted lines around it to illustrate that it has been modified from a previous version via implementation ofsoftware update 127. - A telemetry service operating in network and processing sub-environment may store time series data and event logs for operations associated with the cloud-based application service in time series data/
event logs database 136. For example, the telemetry service may receive operational data for a plurality of operations associated with one or more cloud-based applications, and store that information in addition to the time that the information was received and/or that each corresponding operational event occurred. One or more monitors associated with the cloud-based application service may analyze operational time series data from the telemetry service to determine whether code regressions exist in software for one or more of the applications hosted by the cloud-based application service. - In dynamic monitors sub-environment 130,
old monitors 132 correspond to each monitor that was monitoring telemetry data for the cloud-based application service prior to implementation ofsoftware update 127. That is, monitorM1 106* monitored telemetry data foroperation1 106 forapplication A 104;monitor M2 108* monitored telemetry data foroperation2 108 forapplication A 104;monitor M3 110* monitored telemetry data foroperation3 110 forapplication A 104;monitor M4 114* monitored telemetry data foroperation4 114 forapplication B 112;monitor M5 116* monitored telemetry data foroperation5 116 forapplication B 112;monitor M7 122* monitored telemetry data foroperation7 122 forapplication C 120;operation M8 122* monitored telemetry data foroperation8 124 for application C; andoperation M9 126* monitored telemetry data foroperation9 126 for application C. - A telemetry analysis and comparison engine associated with the telemetry service and/or monitor service may analyze time series data and event logs from operational events that took place in a time series prior to implementation of
software update 127 and compare that data with time series data and event logs from operational events that took place after implementation ofsoftware update 127. The time series data and/or event logs may include a name of each operation that was executed in relation to the cloud-based application service, an indication of whether each operational event was successful or unsuccessful, a duration of time that each operation took to complete and/or time out, a designation of whether a monitor exists for monitoring each operation, and/or a designation of a specific monitor that is monitoring each operation if such a monitor exists. In comparing the time series data and event logs from the pre-software update time series and the post-software update time series, the telemetry analysis and comparison engine may make a determination as to which operations have been deleted, added and/or modified by implementation ofsoftware update 127. Additionally, the telemetry analysis and comparison engine may make a determination based on the time series comparison as to which operations in the post-software update time series have existing monitors, which operations in the post-software update time series need to be updated based on operations being modified via implementation ofsoftware update 127, which operations in the post-software update need new monitors to be created for them because they have been added via implementation ofsoftware update 127, and/or whether and which monitors need to be deleted because operations have been deleted via implementation ofsoftware update 127. - Once a determination has been made as to which operations have been modified, deleted and/or added via
software update 127, a dynamic monitor engine may modify one or more existing monitors, delete one or more existing monitors and/or generate one or more new monitors. That is, for each operation that was deleted viasoftware update 127 for which a monitor existed, the dynamic monitor engine may delete that monitor; for each new operation that was added viasoftware update 127, dynamic monitor engine may generate a new monitor; and for each operation that was modified viasoftware update 127 for which a monitor existed, the dynamic monitor engine may modify the corresponding monitor. - In modifying and/or generating new monitors, the dynamic monitor engine may apply one or more machine learning models (e.g., Holt Winters, principal component analysis, etc.) to telemetry datasets to determine one or more baselines for operation success or failure levels and/or counts, and/or baselines for operation execution latency. For example, one or more machine learning models may be applied to one or more post-software update time series of operational data and determine a baseline that success, error and/or latency datapoints from that time series fall into. In some examples, the baseline may be dynamic in that it changes based on one or more factors (e.g., time of day, date, month, context, etc.). In other examples, the baseline may be static in that it does not change regardless of the factors that are associated with it. For example, a dynamic baseline may be generated for an operation failure count and/or percentage of operational failures where a number of operational failures and/or percentage of operational failures increase during various times of day and/or night. Alternatively, a static baseline may be generated for an operation failure count and/or percentage of operational failures where a number and/or percentage of operational failures remain relatively static over the course of a day.
- Once a baseline has been determined for an operation, a threshold value outside of that baseline may be identified for flagging time series data that exceeds that threshold. For example, a monitor for an operation may identify a threshold number of operational failures outside of a baseline failure number that a time series must exceed over a threshold duration of time (i.e., “time window”) for the monitor to flag the time series as potentially relating to a code regression or otherwise being potentially problematic. In another example, a monitor for an operation may identify a threshold percentage of operational failures outside of a baseline failure percentage that a time series must exceed over a threshold duration of time for the monitor to flag the time series as potentially relating to a code regression or otherwise being potentially problematic. In another example, a monitor for an operation may identify a threshold duration of time outside of a baseline latency duration that operations for a time series must exceed for the monitor to flag the time series as potentially relating to a code regression or otherwise being potentially problematic. The threshold values may be automatically determined based on one or more machine learning models and/or the threshold values may be manually selected. Like the baselines, the threshold values may be dynamic or static. In some examples, the thresholds may be set to zero (i.e., any readings over or under a baseline may result in operation flagging).
- Additional information regarding the baseline and threshold generation and application of the same is provided herein in relation to
FIG. 3 andFIG. 4 . - In dynamic monitors sub-environment 130,
new monitors 134 correspond to the monitors that exist for operations post-incorporation ofsoftware update 127. Specifically, whilemonitors M1 106*,M2 108*,M4 114*,M5 116*,M7 122* andM8 124* remainpost-software update 127 because their corresponding operations have not been modified or deleted, monitorM3 110* has been deleted (as indicated by the dashed line surrounding it) because its corresponding operation (operation3 110) has been deleted viasoftware update 127, monitorM9 126* has been modified (as indicated by the dotted line surrounding it) because its corresponding operation (operation9 126) has been dynamically modified viasoftware update 127, and monitorM6 118* has been dynamically generated by the dynamic monitor engine because its corresponding operation (operation6 118) has been added toapplication B 112 viasoftware update 127. -
FIG. 2 illustrates a basic flow diagram 200 for dynamically configuring monitors for a cloud-based application service. Flow diagram 200 includesapplication development sub-environment 228 andsoftware update 227.Application sub-environment 228 includes three development teams (team A, team B and team C), which are all software development teams that create and update software for the cloud-based application service. In this example, the teams have rolled out a new software update (software update 227) which is being applied to the cloud-based application service. The software update has been applied to one or more cloud-based applications, and time series data for the applications prior to and after the software update has been stored in time series data/event logs database 202. - Log
datamining element 204 comprises the analysis and comparison of operation logs from one or more pre-software update time series and one or more post-software update time series. A comparison engine may identify which operations and corresponding monitors existed in the cloud-based application service and/or telemetry service pre-software update and compare those operations and monitors to operations and monitors that exist in the cloud-based application service and/or telemetry service post-software update. - The result of the log datamining is illustrated by
comparison results element 205.Comparison results element 205 includes old operations 206 (i.e., operations that existed pre-software update and/or that still exist from the previous software build post-software update), new operations 208 (i.e., operations that have been added via software update 227), and modified operations 210 (i.e., operations that existed pre-software update and that have been modified via incorporation of software update 227). - New
monitor generation element 212 illustrates the generation of new monitors that are dynamically created via application of machine learning models to time series data for operations that are new to the cloud-based application service based on implementation ofsoftware update 227. Newmonitor generation element 212 also illustrates the deletion of existing monitors that are no longer useful due to deletion of operations from the cloud-based application service based on their removal viasoftware update 227. Additionally, newmonitor generation element 212 illustrates the modification of exiting monitors that are modified to accurately flag code regressions based on the corresponding operations that have been modified viasoftware update 227. -
FIG. 3 illustrates agraphical display 302 of operation data for a cloud-based application service with a dynamically configured monitor applied to a quality of service metric. The data corresponding tographical display 302 may be associated with a telemetry service that receives and stores operational event data and operation event logs for the cloud-based application service. The telemetry service may keep logs of each operation that has been initiated by a user of a cloud-based application hosted on the service. The logs may include an identity of each operation that was initiated, whether each initiated operation was successfully or unsuccessfully executed, a duration of time that each operation took to complete or time out, an identity of a server or server farm that executed each request associated with each operation, an indication of whether a monitor is associated with each initiated operation, and/or an identity of each monitor that is associated with each initiated operation. - The telemetry service may take the raw operational event data that it collects/receives and generate one or more graphs of the data, such as the graph shown on
graphical display 302. The graph inFIG. 3 is a quality of service graph (for the cloud-based application service) that illustrates a percentage of initiated operations of a specific operation type that have been executed successfully (on the Y-axis) over a duration of time (on the X-axis). In this example, the dynamic monitor engine has automatically generated a new monitor for the specific operation that is represented in the graph. In doing so, the dynamic monitor engine has determined and set abaseline percentage 304 corresponding to a successful operation range for the specific operation that the operation typically falls within. This range, which is illustrated by the diagonal lined rectangle of thebaseline percentage 304 from 85 percent to 100 percent, is shown as being static for the duration of time included in the graph. However, it should be understood that the baseline may be dynamic or static. For example, the illustrated baseline may relate to a day-time normalized range for the specific operation, and the baseline may change for an evening-time normalized range for the specific operation. - In this example, a
threshold 308 has been identified and set either by a machine learning mechanism or a manual interaction with the telemetry service.Threshold 308 corresponds to a percentage outside of the baseline that the quality of service has to drop to for the telemetry service to flag the specific operation as potentially being an issue (e.g., the specific operation potentially relating to a code regression). In this example, thethreshold 308 has been set to 77.5 percent (i.e., 7.5 percent below the lower bounds of baseline percentage 304). Thus, in this example, the monitor for the specific operation would flag theportion 306 of the graph that has fallen belowthreshold 308. Although the specific operation has been flagged due to identification of theportion 306 of the graph falling belowthreshold 308, that time series data and/or time series data surrounding that time series data may be provided for analysis of the flagged issue. -
FIG. 4 illustrates agraphical display 402 of operation data for a cloud-based application service with a dynamically configured monitor applied to a total unexpected failure metric. The data corresponding tographical display 402 may be associated with a telemetry service that receives and stores operational event data and operation event logs for the cloud-based application service. The telemetry service may keep logs of each operation that has been initiated by a user of a cloud-based application hosted on the service. The logs may include an identity of each operation that was initiated, whether each initiated operation was successfully or unsuccessfully executed, a duration of time that each operation took to complete or time out, an identity of a server or server farm that executed each request associated with each operation, an indication of whether a monitor is associated with each initiated operation, and/or an identity of each monitor that is associated with each initiated operation. - The telemetry service may take the raw operational event data that it collects/receives and generate one or more graphs of the data, such as the graph shown on
graphical display 402. The graph inFIG. 4 is an unexpected failure graph (for the cloud-based application service) that illustrates a number of initiated operations of a specific operation type that have resulted in an unexpected failure (on the Y-axis) over a duration of time (on the X-axis). In this example, the dynamic monitor engine has automatically generated a new monitor for the specific operation that is represented in the graph. In doing so, the dynamic monitor engine has determined and set abaseline number range 404 of unexpected failures for the specific operation. This spread of numbers corresponds to a range of unexpected failures for the specific operation that the specific operation typically falls within. This range, which is illustrated by the diagonal lined rectangle ongraphical display 402, has been set to 0 unexpected failures for the specific operation to 1750 unexpected failures for the specific operation. This range may have been identified based on application of one or more machine learning models to time series data for the specific operation. In this example, the range is a static range; however, it should be understood that the baseline may be dynamic or static. For example, the illustrated baseline may relate to a day-time normalized range for the specific operation, and the baseline may change for an evening-time normalized range for the specific operation (e.g., more or less users may utilize the specific operation during certain hours compared with certain other hours). - In this example, a
threshold 408 has been identified and set either by application of a machine learning model to one or more time series for the specific operation or a manual interaction with the telemetry service.Threshold 408 corresponds to a minimum number of total unexpected failures above the baseline that needs to be reached in a time window of a time series for the telemetry service to flag the specific operation as potentially being an issue (e.g., the specific operation potentially relating to a code regression). In this example, thethreshold 408 has been set to 3000 unexpected failures (i.e., 1250 unexpected errors more than the upper bounds of the baseline). Thus, in this example, the monitor for the specific operation would flag theportion 406 of the graph that is abovethreshold 408. Although the specific operation has been flagged due to identification of theportion 406 of the graph that is abovethreshold 408, that time series data and/or time series data surrounding that time series data may be provided for analysis of the flagged issue. -
FIG. 5 is anexemplary method 500 for dynamically configuring monitors for a cloud-based application service. Themethod 500 begins at a start operation and flow moves tooperation 502. - At
operation 502 telemetry data for a plurality of operations for the cloud-based application service is analyzed. The analysis may comprise comparing a first time series with a second time series, where the data from the second time series relates to operations that were executed prior in time compared with execution of operations related to the first time series. Each analyzed time series may comprise operational event data from operations associated with the cloud-based application service. That operational event data may include an identity/name of each operation that was initiated, executed, and/or completed; a time of occurrence that each operation was initiated, executed, and/or completed; an indication of whether a monitor exists for each operation that was initiated, executed, and/or completed; an identity of each monitor that exists for each operation that was initiated, executed, and/or completed; a duration of time that each operation took to complete and/or time out; and/or an identity of a server or server farm that attempted execution of each operation. In some examples, the comparison of the two time series may be automatically initiated based on an indication that a software update has been provided to the cloud-based application service. The second time series may correspond to a duration of time prior to incorporation of the software update in the cloud-based application service, and the first time series may correspond to a duration of time after incorporation of the software update in the cloud-based application service. - From
operation 502 flow continues tooperation 504 where one or more operational changes in the cloud-based application service are identified. The identification is based on the time series comparison performed atoperation 502. The identification is made by determining whether operations have been added, deleted and/or modified based on the software update to the cloud-based application service. For example, when a software update is incorporated into the cloud-based application service, one or more operations that were included in one or more applications hosted by the cloud-based application service prior to the software update may be modified, added, or deleted based on the incorporation of that software update. - From
operation 504 flow continues tooperation 506 where at least one telemetry monitor is dynamically configured based on the one or more operational changes that were identified atoperation 502. The dynamic configuration may be performed by a dynamic monitor engine and application of one or more machine learning models to time series data as discussed herein. The dynamic configuration may comprise selecting an appropriate monitoring technique to detect failure patterns for each new operation associated with the one or more operational changes. The dynamically configuring may additionally or alternatively comprise automatically defining a baseline failure rate for each new operation associated with the one or more operation changes and/or defining a threshold failure rate from the baseline failure rate for each new operation associated with the one or more operational changes. In some examples, dynamically configuring the at least one telemetry monitor may include automatically defining a time series window. The time series window may comprise operations executed in a duration of time that each telemetry monitor for each new operation associated with the one or more operational changes will monitor. The dynamic monitor engine may configure monitors to have static and/or dynamic baselines and/or thresholds. - In some examples, a number of the one or more new telemetry monitors for generation may be automatically determined based on available bandwidth for monitoring operations in the cloud-based application service. In additional examples, a number of time series and/or duration of each time series that is monitored may be automatically determined based on available bandwidth for monitoring operations in the cloud-based application service. For example, the processing resources available to the telemetry service may dictate that only a set number of time series may be analyzed over a set duration of time. Therefore, when the monitors are dynamically created, the number of time series they process over that timeframe may be automatically calculated and built into each corresponding monitor.
- From
operation 506 flow moves to an end operation and themethod 500 ends. -
FIGS. 6 and 7 illustrate amobile computing device 600, for example, a mobile telephone, a smart phone, wearable computer, a tablet computer, an e-reader, a laptop computer, AR compatible computing device, or a VR computing device, with which embodiments of the disclosure may be practiced. With reference toFIG. 6 , one aspect of amobile computing device 600 for implementing the aspects is illustrated. In a basic configuration, themobile computing device 600 is a handheld computer having both input elements and output elements. Themobile computing device 600 typically includes adisplay 605 and one ormore input buttons 610 that allow the user to enter information into themobile computing device 600. Thedisplay 605 of themobile computing device 600 may also function as an input device (e.g., a touch screen display). If included, an optionalside input element 615 allows further user input. Theside input element 615 may be a rotary switch, a button, or any other type of manual input element. In alternative aspects,mobile computing device 600 may incorporate more or fewer input elements. For example, thedisplay 605 may not be a touch screen in some embodiments. In yet another alternative embodiment, themobile computing device 600 is a portable phone system, such as a cellular phone. Themobile computing device 600 may also include anoptional keypad 635.Optional keypad 635 may be a physical keypad or a “soft” keypad generated on the touch screen display. In various embodiments, the output elements include thedisplay 605 for showing a graphical user interface (GUI), a visual indicator 620 (e.g., a light emitting diode), and/or an audio transducer 625 (e.g., a speaker). In some aspects, themobile computing device 600 incorporates a vibration transducer for providing the user with tactile feedback. In yet another aspect, themobile computing device 600 incorporates input and/or output ports, such as an audio input (e.g., a microphone jack), an audio output (e.g., a headphone jack), and a video output (e.g., a HDMI port) for sending signals to or receiving signals from an external device. -
FIG. 7 is a block diagram illustrating the architecture of one aspect of a mobile computing device. That is, themobile computing device 700 can incorporate a system (e.g., an architecture) 702 to implement some aspects. In one embodiment, thesystem 702 is implemented as a “smart phone” capable of running one or more applications (e.g., browser, e-mail, calendaring, contact managers, messaging clients, games, and media clients/players). In some aspects, thesystem 702 is integrated as a computing device, such as an integrated personal digital assistant (PDA) and wireless phone. - One or
more application programs 766 may be loaded into thememory 762 and run on or in association with the operating system 864. Examples of the application programs include phone dialer programs, e-mail programs, personal information management (PIM) programs, word processing programs, spreadsheet programs, Internet browser programs, messaging programs, and so forth. Thesystem 702 also includes anon-volatile storage area 768 within thememory 762. Thenon-volatile storage area 768 may be used to store persistent information that should not be lost if thesystem 702 is powered down. Theapplication programs 766 may use and store information in thenon-volatile storage area 768, such as e-mail or other messages used by an e-mail application, and the like. A synchronization application (not shown) also resides on thesystem 702 and is programmed to interact with a corresponding synchronization application resident on a host computer to keep the information stored in thenon-volatile storage area 768 synchronized with corresponding information stored at the host computer. As should be appreciated, other applications may be loaded into thememory 762 and run on themobile computing device 700, including instructions for providing and operating a digital assistant computing platform. - The
system 702 has apower supply 770, which may be implemented as one or more batteries. Thepower supply 770 might further include an external power source, such as an AC adapter or a powered docking cradle that supplements or recharges the batteries. - The
system 702 may also include aradio interface layer 772 that performs the function of transmitting and receiving radio frequency communications. Theradio interface layer 772 facilitates wireless connectivity between thesystem 702 and the “outside world,” via a communications carrier or service provider. Transmissions to and from theradio interface layer 772 are conducted under control of theoperating system 764. In other words, communications received by theradio interface layer 772 may be disseminated to theapplication programs 766 via theoperating system 764, and vice versa. - The
visual indicator 620 may be used to provide visual notifications, and/or anaudio interface 774 may be used for producing audible notifications via theaudio transducer 625. In the illustrated embodiment, thevisual indicator 620 is a light emitting diode (LED) and theaudio transducer 625 is a speaker. These devices may be directly coupled to thepower supply 770 so that when activated, they remain on for a duration dictated by the notification mechanism even though theprocessor 760 and other components might shut down for conserving battery power. The LED may be programmed to remain on indefinitely until the user takes action to indicate the powered-on status of the device. Theaudio interface 774 is used to provide audible signals to and receive audible signals from the user. For example, in addition to being coupled to theaudio transducer 625, theaudio interface 774 may also be coupled to a microphone to receive audible input, such as to facilitate a telephone conversation. In accordance with embodiments of the present disclosure, the microphone may also serve as an audio sensor to facilitate control of notifications, as will be described below. Thesystem 702 may further include avideo interface 776 that enables an operation of an on-board camera 630 to record still images, video stream, and the like. - A
mobile computing device 700 implementing thesystem 702 may have additional features or functionality. For example, themobile computing device 700 may also include additional data storage devices (removable and/or non-removable) such as, magnetic disks, optical disks, or tape. Such additional storage is illustrated inFIG. 7 by thenon-volatile storage area 768. - Data/information generated or captured by the
mobile computing device 700 and stored via thesystem 702 may be stored locally on themobile computing device 700, as described above, or the data may be stored on any number of storage media that may be accessed by the device via theradio interface layer 772 or via a wired connection between themobile computing device 700 and a separate computing device associated with themobile computing device 700, for example, a server computer in a distributed computing network, such as the Internet. As should be appreciated such data/information may be accessed via themobile computing device 700 via theradio interface layer 772 or via a distributed computing network. Similarly, such data/information may be readily transferred between computing devices for storage and use according to well-known data/information transfer and storage means, including electronic mail and collaborative data/information sharing systems. -
FIG. 8 is a block diagram illustrating physical components (e.g., hardware) of acomputing device 800 with which aspects of the disclosure may be practiced. The computing device components described below may have computer executable instructions for dynamically configuring operation monitors for a cloud-based application service. In a basic configuration, thecomputing device 800 may include at least oneprocessing unit 802 and asystem memory 804. Depending on the configuration and type of computing device, thesystem memory 804 may comprise, but is not limited to, volatile storage (e.g., random access memory), non-volatile storage (e.g., read-only memory), flash memory, or any combination of such memories. Thesystem memory 804 may include anoperating system 805 suitable for running one or more operation monitoring programs. Theoperating system 805, for example, may be suitable for controlling the operation of thecomputing device 800. Furthermore, embodiments of the disclosure may be practiced in conjunction with a graphics library, other operating systems, or any other application program and is not limited to any particular application or system. This basic configuration is illustrated inFIG. 8 by those components within a dashedline 808. Thecomputing device 800 may have additional features or functionality. For example, thecomputing device 800 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated inFIG. 8 by aremovable storage device 809 and anon-removable storage device 810. - As stated above, a number of program modules and data files may be stored in the
system memory 804. While executing on theprocessing unit 802, the program modules 806 (e.g., real-time code defect telemetry application application 820) may perform processes including, but not limited to, the aspects, as described herein. According to examples, telemetry analysis andcomparison engine 811 may identify one or more pre-software update time series and compare those time series to one or more post-software update time series to determine whether new operational monitors should be generated, existing operational monitors should be modified, and/or existing operational monitors should be deleted.Dynamic monitor engine 813 may perform one or more operations associated with generating new monitors and modifying existing monitors for operations in a cloud-based application service.Log analysis engine 815 may perform one or more operations associated with analyzing operational logs from one or more time series and applying that information in machine learning models to create and/or modify monitors, and/or to determine what operational changes have been made via implementation of software updates to a cloud-based application service.Monitor allocation engine 817 may perform one or more operations associated with determining the system resources available to a cloud-based application service and/or a telemetry service, and setting operational monitoring criteria for each monitor in the system (e.g., how many time series are monitored for each operation, how long the time series are, etc.). - Furthermore, embodiments of the disclosure may be practiced in an electrical circuit comprising discrete electronic elements, packaged or integrated electronic chips containing logic gates, a circuit utilizing a microprocessor, or on a single chip containing electronic elements or microprocessors. For example, embodiments of the disclosure may be practiced via a system-on-a-chip (SOC) where each or many of the components illustrated in
FIG. 8 may be integrated onto a single integrated circuit. Such an SOC device may include one or more processing units, graphics units, communications units, system virtualization units and various application functionality all of which are integrated (or “burned”) onto the chip substrate as a single integrated circuit. When operating via an SOC, the functionality, described herein, with respect to the capability of client to switch protocols may be operated via application-specific logic integrated with other components of thecomputing device 800 on the single integrated circuit (chip). Embodiments of the disclosure may also be practiced using other technologies capable of performing logical operations such as, for example, AND, OR, and NOT, including but not limited to mechanical, optical, fluidic, and quantum technologies. In addition, embodiments of the disclosure may be practiced within a general purpose computer or in any other circuits or systems. - The
computing device 800 may also have one or more input device(s) 812 such as a keyboard, a mouse, a pen, a sound or voice input device, a touch or swipe input device, etc. The output device(s) 814 such as a display, speakers, a printer, etc. may also be included. The aforementioned devices are examples and others may be used. Thecomputing device 800 may include one ormore communication connections 816 allowing communications with other computing devices 850. Examples ofsuitable communication connections 816 include, but are not limited to, radio frequency (RF) transmitter, receiver, and/or transceiver circuitry; universal serial bus (USB), parallel, and/or serial ports. - The term computer readable media as used herein may include computer storage media. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, or program modules. The
system memory 804, theremovable storage device 809, and thenon-removable storage device 810 are all computer storage media examples (e.g., memory storage). Computer storage media may include RAM, ROM, electrically erasable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other article of manufacture which can be used to store information and which can be accessed by thecomputing device 800. Any such computer storage media may be part of thecomputing device 800. Computer storage media does not include a carrier wave or other propagated or modulated data signal. - Communication media may be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” may describe a signal that has one or more characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
-
FIG. 9 illustrates one aspect of the architecture of a system for processing data received at a computing system from a remote source, such as a personal/general computer 904, tablet computing device 906, ormobile computing device 908, as described above. Content displayed atserver device 902 may be stored in different communication channels or other storage types. For example, various documents may be stored using adirectory service 922, aweb portal 924, amailbox service 926, aninstant messaging store 928, or asocial networking site 930. Theprogram modules 806 may be employed by a client that communicates withserver device 902, and/or theprogram modules 806 may be employed byserver device 902. Theserver device 902 may provide data to and from a client computing device such as a personal/general computer 904, a tablet computing device 906 and/or a mobile computing device 908 (e.g., a smart phone) through anetwork 915. By way of example, the computer system described above with respect toFIGS. 6-8 may be embodied in a personal/general computer 904, a tablet computing device 906 and/or a mobile computing device 908 (e.g., a smart phone). Any of these embodiments of the computing devices may obtain content from thestore 916, in addition to receiving graphical data useable to be either pre-processed at a graphic-originating system, or post-processed at a receiving computing system. - Aspects of the present disclosure, for example, are described above with reference to block diagrams and/or operational illustrations of methods, systems, and computer program products according to aspects of the disclosure. The functions/acts noted in the blocks may occur out of the order as shown in any flowchart. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
- The description and illustration of one or more aspects provided in this application are not intended to limit or restrict the scope of the disclosure as claimed in any way. The aspects, examples, and details provided in this application are considered sufficient to convey possession and enable others to make and use the best mode of claimed disclosure. The claimed disclosure should not be construed as being limited to any aspect, example, or detail provided in this application. Regardless of whether shown and described in combination or separately, the various features (both structural and methodological) are intended to be selectively included or omitted to produce an embodiment with a particular set of features. Having been provided with the description and illustration of the present disclosure, one skilled in the art may envision variations, modifications, and alternate aspects falling within the spirit of the broader aspects of the general inventive concept embodied in this application that do not depart from the broader scope of the claimed disclosure.
- The various embodiments described above are provided by way of illustration only and should not be construed to limit the claims attached hereto. Those skilled in the art will readily recognize various modifications and changes that may be made without following the example embodiments and applications illustrated and described herein, and without departing from the true spirit and scope of the following claims.
Claims (24)
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/351,426 US10785105B1 (en) | 2019-03-12 | 2019-03-12 | Dynamic monitoring on service health signals |
EP20714092.2A EP3938909B1 (en) | 2019-03-12 | 2020-03-05 | Dynamic monitoring on cloud based application service |
CN202080019950.2A CN113557499A (en) | 2019-03-12 | 2020-03-05 | Dynamically monitoring cloud-based application services |
PCT/US2020/021056 WO2020185477A1 (en) | 2019-03-12 | 2020-03-05 | Dynamic monitoring on cloud based application service |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/351,426 US10785105B1 (en) | 2019-03-12 | 2019-03-12 | Dynamic monitoring on service health signals |
Publications (2)
Publication Number | Publication Date |
---|---|
US20200295986A1 true US20200295986A1 (en) | 2020-09-17 |
US10785105B1 US10785105B1 (en) | 2020-09-22 |
Family
ID=69960767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/351,426 Active US10785105B1 (en) | 2019-03-12 | 2019-03-12 | Dynamic monitoring on service health signals |
Country Status (4)
Country | Link |
---|---|
US (1) | US10785105B1 (en) |
EP (1) | EP3938909B1 (en) |
CN (1) | CN113557499A (en) |
WO (1) | WO2020185477A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210264294A1 (en) * | 2020-02-26 | 2021-08-26 | Samsung Electronics Co., Ltd. | Systems and methods for predicting storage device failure using machine learning |
US20230004473A1 (en) * | 2019-12-03 | 2023-01-05 | Siemens Industry Software Inc. | Detecting anomalous latent communications in an integrated circuit chip |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11782764B2 (en) | 2021-07-07 | 2023-10-10 | International Business Machines Corporation | Differentiated workload telemetry |
US11870663B1 (en) * | 2022-08-03 | 2024-01-09 | Tableau Software, LLC | Automated regression investigator |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7558985B2 (en) | 2006-02-13 | 2009-07-07 | Sun Microsystems, Inc. | High-efficiency time-series archival system for telemetry signals |
US7714702B2 (en) | 2007-06-04 | 2010-05-11 | The Boeing Company | Health monitoring system for preventing a hazardous condition |
US8429467B2 (en) | 2007-10-19 | 2013-04-23 | Oracle International Corporation | User-triggered diagnostic data gathering |
US7962797B2 (en) | 2009-03-20 | 2011-06-14 | Microsoft Corporation | Automated health model generation and refinement |
WO2012158432A2 (en) | 2011-05-09 | 2012-11-22 | Aptima Inc | Systems and methods for scenario generation and monitoring |
US8694835B2 (en) | 2011-09-21 | 2014-04-08 | International Business Machines Corporation | System health monitoring |
US9590880B2 (en) * | 2013-08-07 | 2017-03-07 | Microsoft Technology Licensing, Llc | Dynamic collection analysis and reporting of telemetry data |
US9893952B2 (en) | 2015-01-09 | 2018-02-13 | Microsoft Technology Licensing, Llc | Dynamic telemetry message profiling and adjustment |
US9626277B2 (en) * | 2015-04-01 | 2017-04-18 | Microsoft Technology Licensing, Llc | Anomaly analysis for software distribution |
US9712418B2 (en) * | 2015-05-26 | 2017-07-18 | Microsoft Technology Licensing, Llc | Automated network control |
US10454796B2 (en) * | 2015-10-08 | 2019-10-22 | Fluke Corporation | Cloud based system and method for managing messages regarding cable test device operation |
US10552282B2 (en) | 2017-03-27 | 2020-02-04 | International Business Machines Corporation | On demand monitoring mechanism to identify root cause of operation problems |
US10416660B2 (en) * | 2017-08-31 | 2019-09-17 | Rockwell Automation Technologies, Inc. | Discrete manufacturing hybrid cloud solution architecture |
US10742486B2 (en) * | 2018-01-08 | 2020-08-11 | Cisco Technology, Inc. | Analyzing common traits in a network assurance system |
US10805185B2 (en) * | 2018-02-14 | 2020-10-13 | Cisco Technology, Inc. | Detecting bug patterns across evolving network software versions |
US10664256B2 (en) * | 2018-06-25 | 2020-05-26 | Microsoft Technology Licensing, Llc | Reducing overhead of software deployment based on existing deployment occurrences |
-
2019
- 2019-03-12 US US16/351,426 patent/US10785105B1/en active Active
-
2020
- 2020-03-05 EP EP20714092.2A patent/EP3938909B1/en active Active
- 2020-03-05 WO PCT/US2020/021056 patent/WO2020185477A1/en unknown
- 2020-03-05 CN CN202080019950.2A patent/CN113557499A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230004473A1 (en) * | 2019-12-03 | 2023-01-05 | Siemens Industry Software Inc. | Detecting anomalous latent communications in an integrated circuit chip |
US11983087B2 (en) * | 2019-12-03 | 2024-05-14 | Siemens Industry Software Inc. | Detecting anomalous latent communications in an integrated circuit chip |
US20210264294A1 (en) * | 2020-02-26 | 2021-08-26 | Samsung Electronics Co., Ltd. | Systems and methods for predicting storage device failure using machine learning |
US11657300B2 (en) * | 2020-02-26 | 2023-05-23 | Samsung Electronics Co., Ltd. | Systems and methods for predicting storage device failure using machine learning |
US20230281489A1 (en) * | 2020-02-26 | 2023-09-07 | Samsung Electronics Co., Ltd. | Systems and methods for predicting storage device failure using machine learning |
Also Published As
Publication number | Publication date |
---|---|
WO2020185477A1 (en) | 2020-09-17 |
CN113557499A (en) | 2021-10-26 |
EP3938909A1 (en) | 2022-01-19 |
EP3938909B1 (en) | 2023-05-17 |
US10785105B1 (en) | 2020-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3938909B1 (en) | Dynamic monitoring on cloud based application service | |
US20140359593A1 (en) | Maintaining known dependencies for updates | |
KR102589649B1 (en) | Machine learning decision-guiding techniques for alerts generated in monitoring systems | |
US20150058681A1 (en) | Monitoring, detection and analysis of data from different services | |
US20160070555A1 (en) | Automated tenant upgrades for multi-tenant services | |
US11720461B2 (en) | Automated detection of code regressions from time-series data | |
CN111581005B (en) | Terminal repairing method, terminal and storage medium | |
US11650805B2 (en) | Reliable feature graduation | |
US20180006881A1 (en) | Fault-Tolerant Configuration of Network Devices | |
US20180121174A1 (en) | Centralized coding time tracking and management | |
US11526898B2 (en) | Dynamic visualization of product usage tree based on raw telemetry data | |
US20180069774A1 (en) | Monitoring and reporting transmission and completeness of data upload from a source location to a destination location | |
US10761828B2 (en) | Deviation finder | |
US20230385164A1 (en) | Systems and Methods for Disaster Recovery for Edge Devices | |
US20230275816A1 (en) | Success rate indicator solution for pipelines | |
US11826657B2 (en) | Game performance prediction from real-world performance data | |
US10956307B2 (en) | Detection of code defects via analysis of telemetry data across internal validation rings | |
US20180335940A1 (en) | Universal graphical user interface objects | |
US11093237B2 (en) | Build isolation system in a multi-system environment | |
CN112764957A (en) | Application fault delimiting method and device | |
US20180150556A1 (en) | Auto-Generation Of Key-Value Clusters To Classify Implicit APP Queries and Increase Coverage for Existing Classified Queries | |
US20240223479A1 (en) | Resource anomaly detection | |
WO2023235041A1 (en) | Systems and methods for disaster recovery for edge devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RATHINASABAPATHY, MANGALAM;NIGAM, RAHUL;MENON, VINOD;AND OTHERS;REEL/FRAME:048582/0857 Effective date: 20190312 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |