US20170177468A1 - Anomaly Analysis For Software Distribution - Google Patents
Anomaly Analysis For Software Distribution Download PDFInfo
- Publication number
- US20170177468A1 US20170177468A1 US15/453,782 US201715453782A US2017177468A1 US 20170177468 A1 US20170177468 A1 US 20170177468A1 US 201715453782 A US201715453782 A US 201715453782A US 2017177468 A1 US2017177468 A1 US 2017177468A1
- Authority
- US
- United States
- Prior art keywords
- software
- update
- data
- telemetry
- updates
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000004458 analytical method Methods 0.000 title description 20
- 238000000034 method Methods 0.000 claims description 36
- 230000000875 corresponding effect Effects 0.000 claims description 15
- 230000002596 correlated effect Effects 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 5
- 230000001276 controlling effect Effects 0.000 claims 3
- 230000002401 inhibitory effect Effects 0.000 claims 1
- 230000008859 change Effects 0.000 abstract description 17
- 230000006870 function Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 230000003068 static effect Effects 0.000 description 3
- 230000002547 anomalous effect Effects 0.000 description 2
- 238000010219 correlation analysis Methods 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 230000008570 general process Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3616—Software analysis for verifying properties of programs using software metrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3684—Test management for test design, e.g. generating new test cases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/65—Updates
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/064—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
- H04L41/0813—Configuration setting characterised by the conditions triggering a change of settings
- H04L41/082—Configuration setting characterised by the conditions triggering a change of settings the condition being updates or upgrades of network functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G06F11/3419—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/554—Detecting local intrusion or implementing counter-measures involving event detection and direct action
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/86—Event-based monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/88—Monitoring involving counting
Definitions
- a device In a short time period, a device might receive many software updates and might transmit many telemetry reports to a variety of telemetry collectors.
- a software distribution system might rapidly issue many different software updates to many different devices.
- devices provide feedback telemetry about performance, crashes, stack dumps, execution traces, etc., around the same time, many software components on the devices might be changing. Therefore, it can be difficult for a software developer to use the telemetry feedback to decide whether a particular software update created or fixed any problems. If an anomaly is occurring on some devices, it can be difficult to determine whether any particular software update is implicated, any conditions under which an update might be linked to an anomaly, or what particular code-level changes in a software update are implicated. In short, high rates of software updating and telemetry reporting, perhaps by devices with varying architectures and operating systems, has made it difficult to find correlations between software updates (or source code changes) and anomalies manifested in telemetry feedback.
- a population of devices provides telemetry data and receives software changes or updates.
- Event buckets for respective events are found.
- Event buckets have counts of event instances, where each event instance is an occurrence of a corresponding event reported as telemetry by a device.
- Records of the software changes are provided, each change record representing a software change on a corresponding device.
- the event buckets are analyzed to identify which indicate an anomaly. Based on the change records and the identified event buckets, correlations between the software changes and the identified event buckets are found.
- FIG. 1 shows an example of a software ecosystem.
- FIG. 2 shows a device receiving updates and transmitting telemetry reports.
- FIG. 3 shows a graph that illustrates a global view of updates and telemetry feedback over time for a population of devices.
- FIG. 4 shows a general process for finding correlations between updates and anomalies.
- FIG. 5 shows a graph
- FIG. 6 shows an example of an update store.
- FIG. 7 shows an example of a telemetry store.
- FIG. 8 shows an example correlation engine
- FIG. 9 shows an example of an association between a source code file and a crash bucket.
- FIG. 10 shows a software architecture that can be used to implement multiple correlation engines for respective telemetry sources.
- FIG. 11 shows a client application that allows a user to provide feedback and display and navigate analysis output captured in the analysis database.
- FIG. 12 shows an example of an anomaly summary user interface.
- FIG. 13 shows an example of a computing device.
- FIG. 1 shows an example of a software ecosystem.
- Update services 100 provide software updates 102 to devices 104 .
- Telemetry collection services 106 collect telemetry reports 108 from the devices 104 via a network 110 .
- the shapes of the graphics representing the devices 104 portray different types of processors, such as ARM, x86, PowerPCTM, Apple A4 or A5TM, QualcommTM, or others.
- the shading of the graphics representing the devices 104 indicates different operating system types or versions, for example, UbuntuTM, Apple iOSTM, Apple OS XTM, Microsoft WindowsTM, and AndroidTM.
- the updateable devices 104 may be any type of device with communication capability, processing hardware, and storage hardware working in conjunction therewith. Gaming consoles, cellular telephones, networked appliances, notebook computers, server computers, set-top boxes, autonomous sensors, tablets, or other types of devices with communication and computing capabilities are all examples of devices 104 , as referred to herein.
- An update 102 can be implemented in a variety of ways.
- An update 102 can be a package configured or formatted to be parsed and applied by an installation program or service running on a device 104 .
- An update 102 might be one or more files that are copied to appropriate file system locations on devices 104 , possibly replacing prior versions.
- An update 102 might be a script or command that reconfigures software on the device 104 without necessarily changing executable code on the device 104 .
- an update 102 might be a configuration file or other static object used by corresponding software on the device 104 .
- An update 102 can be anything that changes the executable or application. Commonly, an update 102 will involve replacing, adding, or removing at least some executable code on a device 104 .
- An update service 100 in its simplest form, provides software updates 102 to the devices 104 .
- update service 100 might be an HTTP (hypertext transfer protocol) server servicing file download requests.
- An update service 100 might be more complex, for instance, a so-called software store or marketplace that application developers use to propagate updates.
- An update service 100 might also be a backend service working closely with a client side component to transparently select, transmit, and apply updates.
- the update service 100 is a peer-to-peer service where peers share updates with each other.
- an update service 100 is a network application or service running on a group of servers, for instance as a cloud service, that responds to requests from devices 104 by transmitting software updates 102 . Any known technology may be used to implement an update service 100 .
- there may be multiple update services 100 possibly operated by different entities or firms.
- an update service 100 includes an update distributor 112 and an updates database 114 .
- the updates database 114 may have records for respective updates 102 . Each update record identifies the corresponding update 102 , information about the update 102 such as an intended target (e.g., target operating system, target hardware, software version, etc.), a location of the update 102 , and other related data as discussed further below.
- the update distributor 112 cooperates with the updates database 114 to determine what updates are to be made available to, or transferred to, any particular device 104 . As will be described further below, finding correlations between updates and anomalies can be facilitated by keeping track of which particular updates 102 have been installed on which particular devices 104 . This can be tracked in the updates database 114 or elsewhere.
- each time a particular update 102 is provided to a particular device 104 an update instance record is stored identifying the particular update, the particular device 104 , and the update 102 .
- this information is obtained indirectly, for instance from logs received from a device 104 , perhaps well after an update has been applied.
- the telemetry reports 108 are any communications pushed or pulled from the devices 104 . Telemetry reports 108 indicate respective occurrences on the devices 104 , and are collected by a telemetry collector 116 and then stored in a telemetry database 118 . Examples of types of telemetry reports 108 that can be used are: operating system crash dumps (possibly including stack traces), application crash dumps, system log files, application log files, execution trace files, patch logs, performance metrics (e.g., CPU load, memory usage, cache collisions, network performance statistics, etc.), or any other information that can be tied to software behavior on a device 104 . As discussed in detail further below, in one embodiment, text communications published on the Internet are mined as a telemetry source.
- embodiments described herein can improve the diagnostic value—with respect to updates—of telemetry sources that provide information about general system health or that are not specific-to or limited-to any particular application or software.
- telemetry sources that provide information about general system health or that are not specific-to or limited-to any particular application or software.
- a telemetry collection service 106 will usually have some collection mechanism, e.g., a telemetry collector 116 , and a storage or database—e.g., a telemetry database 118 —that collates and stores incoming telemetry reports 108 . Due to potentially large volumes of raw telemetry data, in some embodiments, key items of data can be extracted from incoming telemetry reports 108 and stored in the telemetry database 118 before disposing of the raw telemetry data.
- a storage or database e.g., a telemetry database 118
- FIG. 2 shows a device 104 receiving updates 102 and transmitting telemetry reports 108 .
- a device may therefore have a steady flow of incoming software updates 102 and outgoing telemetry reports 108 passing through the network 110 .
- the updates 102 may be received and handled by one or more update elements 140 .
- An update element 140 can be an application, a system service, a file system, or any other software element that can push or pull updates 102 .
- Some update elements 140 may also have logic to apply an update 102 by unpacking files, installing binary files, setting up configuration data, registering software, etc.
- An update element 140 may simply be a storage location that automatically executes incoming updates 102 in the form of scripts.
- an update element 140 can be any known tool or process for enabling software updates 102 .
- a device 104 may also have one or more telemetry reporting elements 142 .
- any type of telemetry data can be emitted in telemetry reports 108 .
- Any available reporting frameworks may be used, for instance CrashlyticsTM, TestFlightTM, Dr. WatsonTM, the OS X Crash ReporterTM, BugSplatTM, the Application Crash Report for Android Tool, etc.
- a reporting element 142 can also be any known tool or framework such as the diagnostic and logging service of OS X, the Microsoft Windows instrumentation tools, which will forward events and errors generated by applications or other software, etc.
- a reporting element 142 can also be an application or other software unit that itself sends a message or other communication when the application or software unit encounters diagnostic-relevant information such as a performance glitch, a repeating error or failure, or other events. In other words, software can self-report.
- the telemetry reports 108 from the reporting elements 142 may go to multiple respective collection services 106 , or they may to go a same collection service 106 .
- the telemetry reports 108 from a device 104 will at times convey indications of events that occur on the device.
- the flow of telemetry reports 108 can collectively function like a signal that includes features that correspond to events or occurrences on the device.
- Such events or anomaly instances can, as discussed above, be any type of information that is potentially helpful for evaluating software performance, such as operating system or application crashes, errors self-reported by software or handled by the device's operating system, excessive or spikey storage, processor, or network usage, reboots, patching failures, repetitive user actions (e.g., repeated attempts to install an update, repeatedly restarting an application, etc.).
- FIG. 3 shows a graph 160 that illustrates a global view of updates and telemetry feedback over time for a population of devices 104 .
- an update refers to a same change applied to many devices 104 .
- An update instance refers to an application of a particular update on a particular device 104 .
- event instances are things that occur on different devices 104 and which can be recorded by or reflected by telemetry on the devices.
- event instances on different devices 104 are the same (e.g., a same program crashes, a same function generates an error, a same performance measure is excessive, a same item is entered in a log file, etc.), those same event instances are each instances of the same event.
- an event may be uncovered by analyzing and filtering event instances.
- further analysis may indicate that an event is an anomaly, which is an event that is statistically unexpected or unusual or otherwise of interest.
- finding potentially causal relationships between particular updates and particular anomalies in the telemetry data can be difficult. Lag, noise, and other factors discussed below can obscure relationships.
- FIG. 4 shows a general process for finding correlations between updates and anomalies.
- the process assumes the availability of appropriate data/records, without regard for how the records are collected, stored, or accessed.
- the process can be triggered in a variety of ways. For example, the process may be part of a periodic process that attempts to identify problematic updates, or the process may be manually initiated, perhaps with a particular user-selected update being targeted for analysis by user input.
- the process can also run continuously, iteratively processing new data as it becomes available.
- each event bucket is a collection of event instances that correspond to a same event.
- an event bucket may consist of a set of event instances that each represent a system crash, or a same performance trait (e.g., poor network throughput), or any error generated by a same binary executable code (or a same function, or binary code compiled from a same source code file), or a particular type of error such as a memory violation, etc.
- An event bucket can be formed in various ways, for example, by specifying one or more properties of event instance records, by a clustering algorithm that forms clusters of similar event instance records, by selecting an event instance record and searching for event instance records that have some attributes in common with the selected event record, and so forth.
- Event instance records are generally dated (the terms “date” and “time” are used interchangeably herein and refer to either a date, or a date and time, a period of time, etc.). Therefore, each bucket is a chronology of event instances, and counts of those event instances can be obtained for respective regular intervals of time.
- one or more filters appropriate to the type of telemetry data in a bucket may be applied to remove event instance records that would be considered noise or otherwise unneeded.
- records may be filtered by date, operating system type or version, processor, presence of particular files on a device or in a stack trace, and/or any other condition.
- each event bucket is statistically analyzed to identify which buckets indicate anomalies, or more specifically, to identify any anomalies in each bucket.
- Various types of analysis may be used. For example, a regression analysis may be used.
- anomalies are identified using a modified cumulative distribution function (CDF) with supervised machine learning.
- CDF cumulative distribution function
- a standard CDF determines the probability that number X is found in the series of numbers S.
- X may be a count of event instances for a time period Z, such as a day
- the series S is the number of events, for example for previous days 0 to Z ⁇ 1.
- certain time periods may be removed from the series S if they were independently determined to be a true-positive anomaly (e.g., they may have been identified by a developer).
- an anomaly threshold is set, for instance to 95% or 100%; if a candidate anomaly (an X) has a probability (of not being in the series S) that exceeds the anomaly threshold, then an anomaly has been identified.
- FIG. 5 shows a graph 200 .
- the graph 200 compares the number of events for a given telemetry point over time and the given CDF.
- the graph 200 also shows a developer's feedback and a result of a CDF modified by supervised learning. As noted, other types of correlation analyses may be used.
- corresponding sets of update instance records are found. That is, updates are selected, perhaps by date, operating system, or update content (e.g., does an update correspond to a particular source code file), and the corresponding update instance records are found, where each update instance record represents the application of a particular update to a particular device at a particular time or time span.
- the update instance records may also include records that indicate that an update was rolled back, superseded, or otherwise became moot with respect to the corresponding device. For some devices, update instance records can indicate intermittent presence of the corresponding update (e.g., an update may be applied, removed, and applied again).
- correlations are found between the updates and the event buckets with anomalies.
- Any of a variety of heuristics may be used to find correlations or probabilities of correlations. That is, one or more heuristics are used to determine the probability of a particular anomaly being attributable or correlated to a particular software update.
- some examples are: determining the probability that a particular binary (or other software element) in a software update is a source of a telemetry point; determining the extent to which an anomaly aligns with the release of software update; finding a deviation between an update deployment pattern and an anomaly pattern; determining whether an anomaly is persistent after deployment of an update (possibly without intervention).
- each applicable heuristic can be given a weight.
- the weight is adjusted via a feedback loop when reported anomalies are marked by a user as either (i) a true-positive or (ii) false-positive.
- These heuristics and weights can be combined to compute a probability or percentage for each respective anomaly and software update pair. In this way, many possible updates can be evaluated against many concurrent anomalies and the most likely causes can be identified. For run-time optimization, a software update can be excluded from calculation once the software update has reached a significant deployment with no issues or, in the case of an update intended to fix an issue, when the corresponding issues are determined to have been resolved.
- an update may be initially selected, for example by a developer interested in the update, and the identity of the update is used to select event instance records from the devices that have had the update applied.
- buckets can be independently formed and analyzed to identify anomalies.
- An anomaly is selected, for example by a developer or tester. Update instance records are selected based on some relationship with the anomaly (e.g., concurrence in time, presence of an update element in a trace log, stack trace, or crash file, etc.). Correlations between the anomaly and the respective updates are then computed using any of the heuristics mentioned above.
- FIG. 6 shows an example of an update store 220 .
- FIG. 7 shows an example of a telemetry store 240 .
- the update store 220 includes an update table 222 storing records of updates, an update instance table 224 storing records of which updates were applied to which devices and when, and a change table 226 storing records of changes associated with an update.
- the update instance records and the change records link back to related update records.
- the update records store information about the updates and can be created when an update is released or can be initiated by a user when analysis of an update is desired.
- the update records can store any information about the updates, such as versions, release dates, locations of the bits, files, packages, information about target operating systems or architectures, etc.
- the update instance records store any information about an update applied to a device, such as a time of the update, any errors generated by applying the update, and so on.
- the change table 226 stores records indicating which assets are changed by which updates.
- An asset is any object on a device, such as an executable program, a shared library, a configuration file, a device driver, etc.
- the change table 226 is automatically populated by a crawler or other tool that looks for new update records, opens the related package, identifies any affected assets, and records same in the change table 226 .
- the changes can also be obtained from a manifest, if available.
- a source code control system is accessed to obtain information about assets in the update. If an executable file is in the update, the identity of the executable file (e.g., filename and version) is passed to the source code control system.
- the source code control system returns information about any source code files that were changed since the executable file was last compiled. In other words, the source code control system provides information about any source code changes that relate to the update. As will be described further below, this can help extend the correlation analysis to particular source code files or even functions or lines of source code.
- the telemetry store 240 includes a table for each type of telemetry source.
- a logging telemetry source 242 provides logging or trace data, which is stored in a logging table 244 .
- the log or trace files are parsed to extract pertinent information, which is populated into the logging table 244 .
- a crash telemetry source 246 provides application and/or operating system crash dumps.
- a crash parsing engine analyzes the crashes and extracts relevant information such as stack trace data (e.g., which functions or files were active during a crash), key contents of memory when a crash occurred, the specific cause of the crash, and other information.
- the crash information is stored in a crash table 248 .
- links to crash dump files are kept in the crash table 248 and crash dumps are parsed when needed.
- other types of telemetry data can be similarly managed. For instance, performance telemetry data, user-initiated feedback, custom diagnostics, or others.
- a list of binaries, their versions, and bounding dates can be used to find all failures (event instances) within the bounding dates that have one of the binaries in their respective stack traces.
- a telemetry source can be provided by monitoring public commentary.
- User feedback is obtained by crawling Internet web pages, accessing public sources (e.g., TwitterTM messages) for text or multimedia content that makes reference to software that is subject to updates or that might be affected by updates.
- public sources e.g., TwitterTM messages
- For a given update a set of static and dynamic keywords are searched for in content to identify content that is relevant.
- Static keywords are any words that are commonly used when people discuss software problems, for instance “install”, “reboot”, “crashing”, “hanging”, “uninstall”, etc.
- Dynamic keywords are generated according to the update, for example, the name or identifier of an update, names of files affected by the update, a publisher of the update, a name of an operating system, and others.
- a record is created indicating the time of the content, the related update (if any), perhaps whether the content was found (using known text analysis techniques) to be a positive or negative comment, or other information that might indicate whether a person is having difficult or success with an update.
- Counts of dated identified user feedback instances can then serve as another source of telemetry buckets that can be correlated with updates.
- FIG. 8 shows an example correlation engine 260 for correlating telemetry data with updates and possibly source code elements.
- An anomaly detector 262 may first select from the telemetry store 240 any telemetry event instance records, for example any new telemetry event instance records or an records that occurred after an analysis date or after a date when one or more updates were released. Telemetry buckets are formed and counts over intervals of time are computed for each bucket. Filtering and anomaly detection is then performed. Filtering a bucket generally removes records to focus the bucket to a more relevant set.
- a telemetry bucket represents the first occurrence of an event
- the bucket can be filtered based on basic conditions such as a minimum number of records in a bucket, a minimum number of unique devices represented in a bucket, a minimum number of days over which the corresponding event instances occurred, or others that might be relevant to the type of telemetry data.
- basic conditions such as a minimum number of records in a bucket, a minimum number of unique devices represented in a bucket, a minimum number of days over which the corresponding event instances occurred, or others that might be relevant to the type of telemetry data.
- filter conditions may be helpful: if the failure was first seen after a release date for an update (a positive filter); remove bucket records where their current operating system version has insufficient deployment (e.g., less than 70%); and remove bucket records where the distribution of related binaries is insufficient (e.g., less than 70%).
- prior event instances can be used to determine whether the recent event instances in the bucket are normal or not. For example, a regular event spike might indicate that a recent event spike is not an anomaly.
- a baseline historical rate of event instances can also serve as a basis for filtering a bucket, for example, using a standard deviation of the historical rate.
- spikes in a bucket can serve as features potentially indicating anomalies.
- Spikes can be detected in various ways.
- an application crash telemetry often there will be a many-to-many relationship between application crashes and event buckets. So, for a given crash or failure, related buckets are found. Then, the buckets are iterated over to find the distribution of the failure or crash among that bucket's updates. Hit counts are obtained from the release date back to a prior date (e.g., 60 days). The hit counts are adjusted for distribution of the crash or failure and hit counts for the crash/failure are adjusted accordingly.
- hit count mean and variance prior to the update are computed. These are used to determine the cumulative probability of the hit counts after the update was released. Buckets without a sufficiently high hit probability (e.g., 95%) are filtered out.
- one or more heuristics such as statistical regression, are applied to determine whether any of the buckets indicate that an anomaly has occurred.
- Analysis for potential regression conditions can be prioritized. For example, in order: the percentage of times an update is present when a crash occurs; the probability that a spike (potential anomaly) does not occur periodically and that the spike is not related to an installation reboot; the possibility that crashes in a crash bucket have resulted in overall binary crashes to rise; the possibility that a crash spike is related to an operating system update and not third party software; the probability that a spike is consistent rather than temporary; the probability that a spike is maximal over an extended time (e.g., months).
- an anomaly analysis can also depend on whether an event corresponding to a bucket is a new event or has occurred previously.
- the buckets determined to have anomalies are passed to an update correlator 264 .
- the update correlator 264 receives the anomalous buckets and accesses update data, such as the update store 220 . Updates can be selected based on criteria used for selecting the telemetry buckets, based on a pivot date, based on a user selection, or any new updates can be selected. Correlations between the updates and the anomalous telemetry buckets are then computed using any known statistical correlation techniques, including those discussed elsewhere herein.
- a source code correlator 266 can be provided to receive correlations between updates and anomalies.
- the source code correlator 266 determines the correlation of source-code changes per a software update and its correlated telemetry anomaly bucket. Each correlation between a source code change and an anomaly is assigned a rank based on the level of match between the source-code change and the source-code emitting the telemetry points in the anomaly bucket.
- FIG. 9 shows an example of an association between a source code file 280 and a crash bucket 282 . Because the source code file 280 has a function implicated in the crash bucket 282 , the telemetry event instances that make up the crash bucket 282 can be used to find a likelihood of correlation between the source code file 280 and the crash bucket 282 . In addition, if the crash bucket 282 has information about a specific function (or even a line number of code offset), as shown in FIG. 9 , a correlation with the function can also be computed.
- outputs of the correlation engine 260 are stored in an analysis database 268 .
- Tools, clients, automation processes, etc. can access the analysis database 268 .
- an update monitor 270 monitor the analysis database 268 to identify new anomalies and their associated updates.
- the update monitor 270 can use this information to send instructions to an update controller 272 .
- the update controller 272 controls aspects of updating, such as retracting updates available to be transmitted to devices or constraining the scope of availability for updates (e.g., to a particular region, platform, language, operating system).
- the correlation engine 260 can analyze telemetry feedback about updates in near real-time, anomaly-update correlations can be identified even as updates are still rolling out to devices. This can allow updates to be withdrawn or throttled soon after their release, potentially avoiding problems from occurring on devices yet to receive the updates. In addition, anomalies can be spotted for updates that have different binaries for different operating systems or processor architectures. Put another way, a correlation engine can find correlations using telemetry from, and updates to, heterogeneous devices.
- the update monitor 270 can also be used to evaluate the effectiveness of updates that are flagged as fixes.
- a pre-existing event bucketet
- Bucket can be linked to an update and the effectiveness of the update can be evaluated with feedback provided through a user interface of the update monitor 270 .
- An update can be deemed effective if the associated event/anomaly drops below a threshold occurrence rate.
- the update monitor 270 can also be implemented as a client with a user interface that a developer uses to look for issues that might be related to updates identified by the developer. Or, a user can select an anomaly and see which updates, source code files, etc., are associated with the anomaly.
- a dashboard type of user interface is provided, which can display the overall health of the devices as reflected by anomalies in the analysis database 268 . Anomalies can be evaluated by their extent (by geography, device type, etc.), their rate of increase, and so on.
- the dashboard can display alerts when an update is determined to be causing an anomaly. In the case where an update is flagged as a fix intended to remedy an anomaly, the dashboard can provide an indication that the update was successful.
- FIG. 10 shows a software architecture that can be used to implement multiple correlation engines 260 for respective telemetry sources.
- the multiple correlation engines 260 can use the same update data. Differences in types of telemetry data necessarily lead to partially differing implementations. For example, the filters 298 used by one correlation engine 260 may differ from those used in another correlation engine 260 . Nonetheless, some of the anomaly detection and correlation algorithms can be used by any correlation engine 260 .
- Generic anomaly detection functions, such as a CDF can be included in an anomaly detection library 300 .
- generic algorithms such as regression algorithms for determining the probability that an anomaly is due to a given software update, can be provided by an update probability library 302 .
- Functions for finding correlations between a telemetry anomaly and a source-code change for a given software update can be provided by a source code correlation library 304 , which is available for any correlation engines 260 that can make use of it.
- the update probability library 302 and the source code correlation library 304 can also provide abstracted access to update data and source code change data.
- the correlation engines operate as online algorithms, processing telemetry data and update data as it becomes available.
- a correlation engine 260 feeds the relevant information into the above-mentioned reporting dashboard (via the analysis database 268 ).
- the update monitor 270 or other tool can include ranking logic to rank anomalies correlated to updates. Types of anomalies, updates, source code files, etc., can be assigned weights to help prioritize the ranking. Other criteria may be used, such as the magnitude or rate of an anomaly, the duration of an anomaly, the number of anomalies correlated to a same update, and so forth. Anomalies and/or updates can be displayed in the order of their priority ranking to help developers and others to quickly identify the most important issues that need to be addressed.
- the client application 320 displays user interfaces 322 , as shown in FIG. 12 .
- the user interfaces 322 can display information discussed above, such as indicia of anomalies ordered by priority, problematic updates, etc.
- the anomaly detection/correlation analysis can be provided as a network feedback service 324 that is available to any entities responsible for some of the updates being applied to the devices.
- An operating system vendor can operate the telemetry infrastructure. Any third party developer is given access to anomaly/update analysis related only to the third party's software for which updates have been issued. It should be noted that the software designs described herein are partly a result of convenience. In practice, different components can be distributed to various responsible parties.
- the anomaly summary user interface 322 might include: a name or identifier of the anomaly; summary information, such as a date the anomaly was identified, the type of anomaly (e.g., a spike), and scope of the anomaly (e.g., an affected operating system or processor); features of the anomaly like the probability that detection of the anomaly is correct and a measure of the anomaly; a listing of correlated updates and their occurrence rate; a telemetry graph showing counts of instances of the anomaly over time; and telemetry statistics such as an average daily count, counts before and after an update, unique machines reporting the event, and others.
- FIG. 13 shows a computing device 350 .
- the computing device 350 can be any type of device having processing hardware 352 (e.g., CPU(s) and/or GPU(s)), storage hardware 354 able to be read from or written to by the processing hardware 352 , possibly an input device 356 such as a mouse, camera, microphone, touch sensitive surface, sensor, or others, and possibly also a display 358 of any known type.
- processing hardware 352 e.g., CPU(s) and/or GPU(s)
- storage hardware 354 able to be read from or written to by the processing hardware 352
- possibly an input device 356 such as a mouse, camera, microphone, touch sensitive surface, sensor, or others
- a display 358 of any known type.
- One or more of the computing devices 350 implement the software components described above that relate to telemetry, updates, and synthesis thereof.
- Embodiments and features discussed above can also be realized in the form of information stored in volatile or non-volatile computer or device readable hardware.
- This is deemed to include at least hardware such as optical storage (e.g., compact-disk read-only memory (CD-ROM)), magnetic storage, flash read-only memory (ROM), or any current or future means of storing digital information.
- the stored information can be in the form of machine executable instructions (e.g., compiled executable binary code), source code, bytecode, or any other information that can be used to enable or configure computing devices to perform the various embodiments discussed above.
- RAM random-access memory
- CPU central processing unit
- non-volatile media storing information that allows a program or executable to be loaded and executed.
- the embodiments and features can be performed on any type of computing device, including portable devices, workstations, servers, mobile wireless devices, and so on.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Security & Cryptography (AREA)
- Debugging And Monitoring (AREA)
- Stored Programmes (AREA)
Abstract
A population of devices provides telemetry data and receives software changes or updates. Event buckets for respective events are found. Event buckets have counts of event instances, where each event instance is an occurrence of a corresponding event reported as telemetry by a device. Records of the software changes are provided, each change record representing a software change on a corresponding device. The event buckets are analyzed to identify which indicate an anomaly. Based on the change records and the identified event buckets, correlations between the software changes and the identified event buckets are found.
Description
- This application is a continuation patent application of copending application with Ser. No. 14/676,214, (attorney docket no. 356778.01) filed Apr. 1, 2015, entitled “ANOMALY ANALYSIS FOR SOFTWARE DISTRIBUTION”, which is now allowed. The aforementioned application(s) are hereby incorporated herein by reference.
- Devices that run software usually require updates over time. The need for software updates may be driven by many factors, such as fixing bugs, adding new functionality, improving performance, maintaining compatibility with other software, and so forth. While many techniques have been used for updating software, an update typically involves changing the source code of a program, compiling the program, and distributing the program to devices where the updated program will be executed.
- It is becoming more common for programs to be compiled for multiple types of devices and operating systems. Executable code compiled from a same source code file might end up executing on devices with different types of processors, and different types or versions of operating systems. Updates for such cross-platform programs can be difficult to assess.
- In addition, the increasing network connectivity of devices has led to higher rates of updating by software developers and more frequent reporting of performance-related data (telemetry) by devices.
- In a short time period, a device might receive many software updates and might transmit many telemetry reports to a variety of telemetry collectors. A software distribution system might rapidly issue many different software updates to many different devices. As devices provide feedback telemetry about performance, crashes, stack dumps, execution traces, etc., around the same time, many software components on the devices might be changing. Therefore, it can be difficult for a software developer to use the telemetry feedback to decide whether a particular software update created or fixed any problems. If an anomaly is occurring on some devices, it can be difficult to determine whether any particular software update is implicated, any conditions under which an update might be linked to an anomaly, or what particular code-level changes in a software update are implicated. In short, high rates of software updating and telemetry reporting, perhaps by devices with varying architectures and operating systems, has made it difficult to find correlations between software updates (or source code changes) and anomalies manifested in telemetry feedback.
- Techniques related to finding anomalies in telemetry data and finding correlations between anomalies and software updates are discussed below.
- The following summary is included only to introduce some concepts discussed in the Detailed Description below. This summary is not comprehensive and is not intended to delineate the scope of the claimed subject matter, which is set forth by the claims presented at the end.
- A population of devices provides telemetry data and receives software changes or updates. Event buckets for respective events are found. Event buckets have counts of event instances, where each event instance is an occurrence of a corresponding event reported as telemetry by a device. Records of the software changes are provided, each change record representing a software change on a corresponding device. The event buckets are analyzed to identify which indicate an anomaly. Based on the change records and the identified event buckets, correlations between the software changes and the identified event buckets are found.
- Many of the attendant features will be explained below with reference to the following detailed description considered in connection with the accompanying drawings.
- The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein like reference numerals are used to designate like parts in the accompanying description.
-
FIG. 1 shows an example of a software ecosystem. -
FIG. 2 shows a device receiving updates and transmitting telemetry reports. -
FIG. 3 shows a graph that illustrates a global view of updates and telemetry feedback over time for a population of devices. -
FIG. 4 shows a general process for finding correlations between updates and anomalies. -
FIG. 5 shows a graph. -
FIG. 6 shows an example of an update store. -
FIG. 7 shows an example of a telemetry store. -
FIG. 8 shows an example correlation engine. -
FIG. 9 shows an example of an association between a source code file and a crash bucket. -
FIG. 10 shows a software architecture that can be used to implement multiple correlation engines for respective telemetry sources. -
FIG. 11 shows a client application that allows a user to provide feedback and display and navigate analysis output captured in the analysis database. -
FIG. 12 shows an example of an anomaly summary user interface. -
FIG. 13 shows an example of a computing device. -
FIG. 1 shows an example of a software ecosystem.Update services 100 providesoftware updates 102 todevices 104. Telemetrycollection services 106 collecttelemetry reports 108 from thedevices 104 via anetwork 110. The shapes of the graphics representing thedevices 104 portray different types of processors, such as ARM, x86, PowerPC™, Apple A4 or A5™, Snapdragon™, or others. The shading of the graphics representing thedevices 104 indicates different operating system types or versions, for example, Ubuntu™, Apple iOS™, Apple OS X™, Microsoft Windows™, and Android™. Theupdateable devices 104 may be any type of device with communication capability, processing hardware, and storage hardware working in conjunction therewith. Gaming consoles, cellular telephones, networked appliances, notebook computers, server computers, set-top boxes, autonomous sensors, tablets, or other types of devices with communication and computing capabilities are all examples ofdevices 104, as referred to herein. - An
update 102 can be implemented in a variety of ways. Anupdate 102 can be a package configured or formatted to be parsed and applied by an installation program or service running on adevice 104. Anupdate 102 might be one or more files that are copied to appropriate file system locations ondevices 104, possibly replacing prior versions. Anupdate 102 might be a script or command that reconfigures software on thedevice 104 without necessarily changing executable code on thedevice 104. For example, anupdate 102 might be a configuration file or other static object used by corresponding software on thedevice 104. Anupdate 102 can be anything that changes the executable or application. Commonly, anupdate 102 will involve replacing, adding, or removing at least some executable code on adevice 104. - An
update service 100, in its simplest form, providessoftware updates 102 to thedevices 104. For example, andupdate service 100 might be an HTTP (hypertext transfer protocol) server servicing file download requests. Anupdate service 100 might be more complex, for instance, a so-called software store or marketplace that application developers use to propagate updates. Anupdate service 100 might also be a backend service working closely with a client side component to transparently select, transmit, and apply updates. In one embodiment, theupdate service 100 is a peer-to-peer service where peers share updates with each other. In another embodiment, anupdate service 100 is a network application or service running on a group of servers, for instance as a cloud service, that responds to requests fromdevices 104 by transmittingsoftware updates 102. Any known technology may be used to implement anupdate service 100. Moreover, as shown inFIG. 1 , there may bemultiple update services 100, possibly operated by different entities or firms. - In one embodiment, an
update service 100 includes anupdate distributor 112 and anupdates database 114. Theupdates database 114 may have records forrespective updates 102. Each update record identifies thecorresponding update 102, information about theupdate 102 such as an intended target (e.g., target operating system, target hardware, software version, etc.), a location of theupdate 102, and other related data as discussed further below. Theupdate distributor 112 cooperates with theupdates database 114 to determine what updates are to be made available to, or transferred to, anyparticular device 104. As will be described further below, finding correlations between updates and anomalies can be facilitated by keeping track of whichparticular updates 102 have been installed on whichparticular devices 104. This can be tracked in theupdates database 114 or elsewhere. For this purpose, in one embodiment, each time aparticular update 102 is provided to aparticular device 104, an update instance record is stored identifying the particular update, theparticular device 104, and theupdate 102. In some embodiments, this information is obtained indirectly, for instance from logs received from adevice 104, perhaps well after an update has been applied. - The telemetry reports 108 are any communications pushed or pulled from the
devices 104. Telemetry reports 108 indicate respective occurrences on thedevices 104, and are collected by atelemetry collector 116 and then stored in atelemetry database 118. Examples of types of telemetry reports 108 that can be used are: operating system crash dumps (possibly including stack traces), application crash dumps, system log files, application log files, execution trace files, patch logs, performance metrics (e.g., CPU load, memory usage, cache collisions, network performance statistics, etc.), or any other information that can be tied to software behavior on adevice 104. As discussed in detail further below, in one embodiment, text communications published on the Internet are mined as a telemetry source. As will become apparent, embodiments described herein can improve the diagnostic value—with respect to updates—of telemetry sources that provide information about general system health or that are not specific-to or limited-to any particular application or software. By using such telemetry sources on a large scale, it may become possible to find correlations in mass that might not be discoverable on an individual device basis. - Regardless of the type of telemetry data, a
telemetry collection service 106 will usually have some collection mechanism, e.g., atelemetry collector 116, and a storage or database—e.g., atelemetry database 118—that collates and stores incoming telemetry reports 108. Due to potentially large volumes of raw telemetry data, in some embodiments, key items of data can be extracted from incoming telemetry reports 108 and stored in thetelemetry database 118 before disposing of the raw telemetry data. -
FIG. 2 shows adevice 104 receivingupdates 102 and transmitting telemetry reports 108. As discussed above, modern devices often have the ability to communicate frequently over a short period of time (e.g., minutes, hours, or even a few days). A device may therefore have a steady flow of incoming software updates 102 and outgoing telemetry reports 108 passing through thenetwork 110. Theupdates 102 may be received and handled by one ormore update elements 140. Anupdate element 140 can be an application, a system service, a file system, or any other software element that can push or pullupdates 102. Some updateelements 140 may also have logic to apply anupdate 102 by unpacking files, installing binary files, setting up configuration data, registering software, etc. Anupdate element 140 may simply be a storage location that automatically executesincoming updates 102 in the form of scripts. In sum, anupdate element 140 can be any known tool or process for enabling software updates 102. - A
device 104 may also have one or moretelemetry reporting elements 142. As discussed above, any type of telemetry data can be emitted in telemetry reports 108. Any available reporting frameworks may be used, for instance Crashlytics™, TestFlight™, Dr. Watson™, the OS X Crash Reporter™, BugSplat™, the Application Crash Report for Android Tool, etc. Areporting element 142 can also be any known tool or framework such as the diagnostic and logging service of OS X, the Microsoft Windows instrumentation tools, which will forward events and errors generated by applications or other software, etc. Areporting element 142 can also be an application or other software unit that itself sends a message or other communication when the application or software unit encounters diagnostic-relevant information such as a performance glitch, a repeating error or failure, or other events. In other words, software can self-report. The telemetry reports 108 from the reportingelements 142 may go to multiplerespective collection services 106, or they may to go asame collection service 106. - The telemetry reports 108 from a
device 104 will at times convey indications of events that occur on the device. The flow of telemetry reports 108 can collectively function like a signal that includes features that correspond to events or occurrences on the device. Such events or anomaly instances can, as discussed above, be any type of information that is potentially helpful for evaluating software performance, such as operating system or application crashes, errors self-reported by software or handled by the device's operating system, excessive or spikey storage, processor, or network usage, reboots, patching failures, repetitive user actions (e.g., repeated attempts to install an update, repeatedly restarting an application, etc.). -
FIG. 3 shows a graph 160 that illustrates a global view of updates and telemetry feedback over time for a population ofdevices 104. Over time, the number of updates applied ondevices 104 and the number of event instances conveyed in telemetry reports varies over time. As used herein, an update refers to a same change applied tomany devices 104. An update instance refers to an application of a particular update on aparticular device 104. As used herein, event instances are things that occur ondifferent devices 104 and which can be recorded by or reflected by telemetry on the devices. As also used herein, when event instances ondifferent devices 104 are the same (e.g., a same program crashes, a same function generates an error, a same performance measure is excessive, a same item is entered in a log file, etc.), those same event instances are each instances of the same event. As will be described below, in some cases, an event may be uncovered by analyzing and filtering event instances. In some cases, further analysis may indicate that an event is an anomaly, which is an event that is statistically unexpected or unusual or otherwise of interest. As shown by the graph 160, for a mix of concurrent update and telemetry “signals”, finding potentially causal relationships between particular updates and particular anomalies in the telemetry data can be difficult. Lag, noise, and other factors discussed below can obscure relationships. -
FIG. 4 shows a general process for finding correlations between updates and anomalies. The process assumes the availability of appropriate data/records, without regard for how the records are collected, stored, or accessed. The process can be triggered in a variety of ways. For example, the process may be part of a periodic process that attempts to identify problematic updates, or the process may be manually initiated, perhaps with a particular user-selected update being targeted for analysis by user input. The process can also run continuously, iteratively processing new data as it becomes available. - In any case, the process begins at
step 180 by forming event buckets. Each event bucket is a collection of event instances that correspond to a same event. For example, an event bucket may consist of a set of event instances that each represent a system crash, or a same performance trait (e.g., poor network throughput), or any error generated by a same binary executable code (or a same function, or binary code compiled from a same source code file), or a particular type of error such as a memory violation, etc. - An event bucket can be formed in various ways, for example, by specifying one or more properties of event instance records, by a clustering algorithm that forms clusters of similar event instance records, by selecting an event instance record and searching for event instance records that have some attributes in common with the selected event record, and so forth. Event instance records are generally dated (the terms “date” and “time” are used interchangeably herein and refer to either a date, or a date and time, a period of time, etc.). Therefore, each bucket is a chronology of event instances, and counts of those event instances can be obtained for respective regular intervals of time.
- As will be described later, one or more filters appropriate to the type of telemetry data in a bucket may be applied to remove event instance records that would be considered noise or otherwise unneeded. For instance, records may be filtered by date, operating system type or version, processor, presence of particular files on a device or in a stack trace, and/or any other condition.
- At
step 182, each event bucket is statistically analyzed to identify which buckets indicate anomalies, or more specifically, to identify any anomalies in each bucket. Various types of analysis may be used. For example, a regression analysis may be used. In one embodiment, anomalies are identified using a modified cumulative distribution function (CDF) with supervised machine learning. A standard CDF determines the probability that number X is found in the series of numbers S. For example, X may be a count of event instances for a time period Z, such as a day, and the series S is the number of events, for example for previous days 0 to Z−1. Additionally, certain time periods may be removed from the series S if they were independently determined to be a true-positive anomaly (e.g., they may have been identified by a developer). To identify an anomaly, an anomaly threshold is set, for instance to 95% or 100%; if a candidate anomaly (an X) has a probability (of not being in the series S) that exceeds the anomaly threshold, then an anomaly has been identified.FIG. 5 shows a graph 200. The graph 200 compares the number of events for a given telemetry point over time and the given CDF. The graph 200 also shows a developer's feedback and a result of a CDF modified by supervised learning. As noted, other types of correlation analyses may be used. - Returning to
FIG. 4 , atstep 184, for one or more updates, corresponding sets of update instance records are found. That is, updates are selected, perhaps by date, operating system, or update content (e.g., does an update correspond to a particular source code file), and the corresponding update instance records are found, where each update instance record represents the application of a particular update to a particular device at a particular time or time span. The update instance records may also include records that indicate that an update was rolled back, superseded, or otherwise became moot with respect to the corresponding device. For some devices, update instance records can indicate intermittent presence of the corresponding update (e.g., an update may be applied, removed, and applied again). - At
step 186, correlations are found between the updates and the event buckets with anomalies. Any of a variety of heuristics may be used to find correlations or probabilities of correlations. That is, one or more heuristics are used to determine the probability of a particular anomaly being attributable or correlated to a particular software update. Among the heuristics that can be used, some examples are: determining the probability that a particular binary (or other software element) in a software update is a source of a telemetry point; determining the extent to which an anomaly aligns with the release of software update; finding a deviation between an update deployment pattern and an anomaly pattern; determining whether an anomaly is persistent after deployment of an update (possibly without intervention). In one embodiment, each applicable heuristic can be given a weight. The weight is adjusted via a feedback loop when reported anomalies are marked by a user as either (i) a true-positive or (ii) false-positive. These heuristics and weights can be combined to compute a probability or percentage for each respective anomaly and software update pair. In this way, many possible updates can be evaluated against many concurrent anomalies and the most likely causes can be identified. For run-time optimization, a software update can be excluded from calculation once the software update has reached a significant deployment with no issues or, in the case of an update intended to fix an issue, when the corresponding issues are determined to have been resolved. - Within the concept of finding links between updates and anomalies, a number of variations of the algorithm in
FIG. 4 can be used. In addition, the steps inFIG. 4 can be performed in different orders and asynchronously. In one embodiment, an update may be initially selected, for example by a developer interested in the update, and the identity of the update is used to select event instance records from the devices that have had the update applied. In another embodiment, buckets can be independently formed and analyzed to identify anomalies. An anomaly is selected, for example by a developer or tester. Update instance records are selected based on some relationship with the anomaly (e.g., concurrence in time, presence of an update element in a trace log, stack trace, or crash file, etc.). Correlations between the anomaly and the respective updates are then computed using any of the heuristics mentioned above. - As discussed with reference to
FIG. 1 , applications of updates and telemetry reports are collected and stored. In one embodiment, raw telemetry data and update records are accessed and the information needed for finding update-anomaly correlations is extracted, normalized, and stored in another data store or database.FIG. 6 shows an example of anupdate store 220.FIG. 7 shows an example of atelemetry store 240. - The
update store 220 includes an update table 222 storing records of updates, an update instance table 224 storing records of which updates were applied to which devices and when, and a change table 226 storing records of changes associated with an update. The update instance records and the change records link back to related update records. The update records store information about the updates and can be created when an update is released or can be initiated by a user when analysis of an update is desired. The update records can store any information about the updates, such as versions, release dates, locations of the bits, files, packages, information about target operating systems or architectures, etc. The update instance records store any information about an update applied to a device, such as a time of the update, any errors generated by applying the update, and so on. - The change table 226 stores records indicating which assets are changed by which updates. An asset is any object on a device, such as an executable program, a shared library, a configuration file, a device driver, etc. In one embodiment, the change table 226 is automatically populated by a crawler or other tool that looks for new update records, opens the related package, identifies any affected assets, and records same in the change table 226. The changes can also be obtained from a manifest, if available. In one embodiment, a source code control system is accessed to obtain information about assets in the update. If an executable file is in the update, the identity of the executable file (e.g., filename and version) is passed to the source code control system. The source code control system returns information about any source code files that were changed since the executable file was last compiled. In other words, the source code control system provides information about any source code changes that relate to the update. As will be described further below, this can help extend the correlation analysis to particular source code files or even functions or lines of source code.
- Returning to
FIG. 7 , thetelemetry store 240 includes a table for each type of telemetry source. For example, alogging telemetry source 242 provides logging or trace data, which is stored in a logging table 244. In one embodiment, the log or trace files are parsed to extract pertinent information, which is populated into the logging table 244. Acrash telemetry source 246 provides application and/or operating system crash dumps. A crash parsing engine analyzes the crashes and extracts relevant information such as stack trace data (e.g., which functions or files were active during a crash), key contents of memory when a crash occurred, the specific cause of the crash, and other information. The crash information is stored in a crash table 248. In another embodiment, links to crash dump files are kept in the crash table 248 and crash dumps are parsed when needed. As noted above, other types of telemetry data can be similarly managed. For instance, performance telemetry data, user-initiated feedback, custom diagnostics, or others. - It may be noted that in the case of an application crash telemetry source with stack traces, to help manage the scale of telemetry data, a list of binaries, their versions, and bounding dates can be used to find all failures (event instances) within the bounding dates that have one of the binaries in their respective stack traces.
- In one embodiment, a telemetry source can be provided by monitoring public commentary. User feedback is obtained by crawling Internet web pages, accessing public sources (e.g., Twitter™ messages) for text or multimedia content that makes reference to software that is subject to updates or that might be affected by updates. For a given update, a set of static and dynamic keywords are searched for in content to identify content that is relevant. Static keywords are any words that are commonly used when people discuss software problems, for instance “install”, “reboot”, “crashing”, “hanging”, “uninstall”, etc. Dynamic keywords are generated according to the update, for example, the name or identifier of an update, names of files affected by the update, a publisher of the update, a name of an operating system, and others. When keywords sufficiently match an item of user-authored content, a record is created indicating the time of the content, the related update (if any), perhaps whether the content was found (using known text analysis techniques) to be a positive or negative comment, or other information that might indicate whether a person is having difficult or success with an update. Counts of dated identified user feedback instances can then serve as another source of telemetry buckets that can be correlated with updates.
-
FIG. 8 shows anexample correlation engine 260 for correlating telemetry data with updates and possibly source code elements. Ananomaly detector 262 may first select from thetelemetry store 240 any telemetry event instance records, for example any new telemetry event instance records or an records that occurred after an analysis date or after a date when one or more updates were released. Telemetry buckets are formed and counts over intervals of time are computed for each bucket. Filtering and anomaly detection is then performed. Filtering a bucket generally removes records to focus the bucket to a more relevant set. In cases where a telemetry bucket represents the first occurrence of an event, the bucket can be filtered based on basic conditions such as a minimum number of records in a bucket, a minimum number of unique devices represented in a bucket, a minimum number of days over which the corresponding event instances occurred, or others that might be relevant to the type of telemetry data. For a telemetry source related to application crashes, several filter conditions may be helpful: if the failure was first seen after a release date for an update (a positive filter); remove bucket records where their current operating system version has insufficient deployment (e.g., less than 70%); and remove bucket records where the distribution of related binaries is insufficient (e.g., less than 70%). - In cases where a telemetry bucket has items that represent an event that has occurred prior to the date of an update, prior event instances can be used to determine whether the recent event instances in the bucket are normal or not. For example, a regular event spike might indicate that a recent event spike is not an anomaly. A baseline historical rate of event instances can also serve as a basis for filtering a bucket, for example, using a standard deviation of the historical rate.
- As noted above, spikes in a bucket (e.g., rapid increases in occurrences over time) can serve as features potentially indicating anomalies. Spikes can be detected in various ways. In the case of an application crash telemetry, often there will be a many-to-many relationship between application crashes and event buckets. So, for a given crash or failure, related buckets are found. Then, the buckets are iterated over to find the distribution of the failure or crash among that bucket's updates. Hit counts are obtained from the release date back to a prior date (e.g., 60 days). The hit counts are adjusted for distribution of the crash or failure and hit counts for the crash/failure are adjusted accordingly. Then, pivoting on the corresponding update release date, hit count mean and variance prior to the update are computed. These are used to determine the cumulative probability of the hit counts after the update was released. Buckets without a sufficiently high hit probability (e.g., 95%) are filtered out.
- When the buckets have been filtered, one or more heuristics, such as statistical regression, are applied to determine whether any of the buckets indicate that an anomaly has occurred. Analysis for potential regression conditions can be prioritized. For example, in order: the percentage of times an update is present when a crash occurs; the probability that a spike (potential anomaly) does not occur periodically and that the spike is not related to an installation reboot; the possibility that crashes in a crash bucket have resulted in overall binary crashes to rise; the possibility that a crash spike is related to an operating system update and not third party software; the probability that a spike is consistent rather than temporary; the probability that a spike is maximal over an extended time (e.g., months). As suggested above, an anomaly analysis can also depend on whether an event corresponding to a bucket is a new event or has occurred previously. The buckets determined to have anomalies are passed to an
update correlator 264. - The
update correlator 264 receives the anomalous buckets and accesses update data, such as theupdate store 220. Updates can be selected based on criteria used for selecting the telemetry buckets, based on a pivot date, based on a user selection, or any new updates can be selected. Correlations between the updates and the anomalous telemetry buckets are then computed using any known statistical correlation techniques, including those discussed elsewhere herein. - As mentioned above, if detailed update information is available (see change table 226), then depending on the granularity of that information, more specific correlations with anomalies can be found. A
source code correlator 266 can be provided to receive correlations between updates and anomalies. Thesource code correlator 266 determines the correlation of source-code changes per a software update and its correlated telemetry anomaly bucket. Each correlation between a source code change and an anomaly is assigned a rank based on the level of match between the source-code change and the source-code emitting the telemetry points in the anomaly bucket. If a direct match is not found, a search of the call graph (or similar source-code analysis) for the source-code that emitted the telemetry and for the changed source-code list is performed to arrive at a prioritized list of source-code changes that could be causes of the anomaly.FIG. 9 shows an example of an association between asource code file 280 and acrash bucket 282. Because thesource code file 280 has a function implicated in thecrash bucket 282, the telemetry event instances that make up thecrash bucket 282 can be used to find a likelihood of correlation between thesource code file 280 and thecrash bucket 282. In addition, if thecrash bucket 282 has information about a specific function (or even a line number of code offset), as shown inFIG. 9 , a correlation with the function can also be computed. - Returning to
FIG. 8 , outputs of thecorrelation engine 260, such as records of anomaly buckets, records of update-anomaly correlations, and records of source code-anomaly correlations, are stored in ananalysis database 268. Tools, clients, automation processes, etc., can access theanalysis database 268. For example, anupdate monitor 270 monitor theanalysis database 268 to identify new anomalies and their associated updates. The update monitor 270 can use this information to send instructions to anupdate controller 272. Theupdate controller 272 controls aspects of updating, such as retracting updates available to be transmitted to devices or constraining the scope of availability for updates (e.g., to a particular region, platform, language, operating system). Because thecorrelation engine 260 can analyze telemetry feedback about updates in near real-time, anomaly-update correlations can be identified even as updates are still rolling out to devices. This can allow updates to be withdrawn or throttled soon after their release, potentially avoiding problems from occurring on devices yet to receive the updates. In addition, anomalies can be spotted for updates that have different binaries for different operating systems or processor architectures. Put another way, a correlation engine can find correlations using telemetry from, and updates to, heterogeneous devices. - The update monitor 270 can also be used to evaluate the effectiveness of updates that are flagged as fixes. A pre-existing event (bucket) can be linked to an update and the effectiveness of the update can be evaluated with feedback provided through a user interface of the
update monitor 270. An update can be deemed effective if the associated event/anomaly drops below a threshold occurrence rate. - The update monitor 270 can also be implemented as a client with a user interface that a developer uses to look for issues that might be related to updates identified by the developer. Or, a user can select an anomaly and see which updates, source code files, etc., are associated with the anomaly. In one embodiment a dashboard type of user interface is provided, which can display the overall health of the devices as reflected by anomalies in the
analysis database 268. Anomalies can be evaluated by their extent (by geography, device type, etc.), their rate of increase, and so on. The dashboard can display alerts when an update is determined to be causing an anomaly. In the case where an update is flagged as a fix intended to remedy an anomaly, the dashboard can provide an indication that the update was successful. -
FIG. 10 shows a software architecture that can be used to implementmultiple correlation engines 260 for respective telemetry sources. Themultiple correlation engines 260 can use the same update data. Differences in types of telemetry data necessarily lead to partially differing implementations. For example, thefilters 298 used by onecorrelation engine 260 may differ from those used in anothercorrelation engine 260. Nonetheless, some of the anomaly detection and correlation algorithms can be used by anycorrelation engine 260. Generic anomaly detection functions, such as a CDF, can be included in ananomaly detection library 300. Similarly, generic algorithms, such as regression algorithms for determining the probability that an anomaly is due to a given software update, can be provided by anupdate probability library 302. Functions for finding correlations between a telemetry anomaly and a source-code change for a given software update can be provided by a sourcecode correlation library 304, which is available for anycorrelation engines 260 that can make use of it. Theupdate probability library 302 and the sourcecode correlation library 304 can also provide abstracted access to update data and source code change data. - In one embodiment the correlation engines operate as online algorithms, processing telemetry data and update data as it becomes available. Once a
correlation engine 260 has determined all anomalies at the current time, it feeds the relevant information into the above-mentioned reporting dashboard (via the analysis database 268). The update monitor 270 or other tool can include ranking logic to rank anomalies correlated to updates. Types of anomalies, updates, source code files, etc., can be assigned weights to help prioritize the ranking. Other criteria may be used, such as the magnitude or rate of an anomaly, the duration of an anomaly, the number of anomalies correlated to a same update, and so forth. Anomalies and/or updates can be displayed in the order of their priority ranking to help developers and others to quickly identify the most important issues that need to be addressed. - Feedback can inform the supervised learning discussed above. The
client application 320displays user interfaces 322, as shown inFIG. 12 . Theuser interfaces 322 can display information discussed above, such as indicia of anomalies ordered by priority, problematic updates, etc. In one embodiment, the anomaly detection/correlation analysis can be provided as anetwork feedback service 324 that is available to any entities responsible for some of the updates being applied to the devices. An operating system vendor can operate the telemetry infrastructure. Any third party developer is given access to anomaly/update analysis related only to the third party's software for which updates have been issued. It should be noted that the software designs described herein are partly a result of convenience. In practice, different components can be distributed to various responsible parties. - The anomaly
summary user interface 322 might include: a name or identifier of the anomaly; summary information, such as a date the anomaly was identified, the type of anomaly (e.g., a spike), and scope of the anomaly (e.g., an affected operating system or processor); features of the anomaly like the probability that detection of the anomaly is correct and a measure of the anomaly; a listing of correlated updates and their occurrence rate; a telemetry graph showing counts of instances of the anomaly over time; and telemetry statistics such as an average daily count, counts before and after an update, unique machines reporting the event, and others. -
FIG. 13 shows acomputing device 350. Thecomputing device 350 can be any type of device having processing hardware 352 (e.g., CPU(s) and/or GPU(s)),storage hardware 354 able to be read from or written to by theprocessing hardware 352, possibly aninput device 356 such as a mouse, camera, microphone, touch sensitive surface, sensor, or others, and possibly also adisplay 358 of any known type. One or more of thecomputing devices 350 implement the software components described above that relate to telemetry, updates, and synthesis thereof. - Embodiments and features discussed above can also be realized in the form of information stored in volatile or non-volatile computer or device readable hardware. This is deemed to include at least hardware such as optical storage (e.g., compact-disk read-only memory (CD-ROM)), magnetic storage, flash read-only memory (ROM), or any current or future means of storing digital information. The stored information can be in the form of machine executable instructions (e.g., compiled executable binary code), source code, bytecode, or any other information that can be used to enable or configure computing devices to perform the various embodiments discussed above. This is also deemed to include at least volatile memory such as random-access memory (RAM) and/or virtual memory storing information such as central processing unit (CPU) instructions during execution of a program carrying out an embodiment, as well as non-volatile media storing information that allows a program or executable to be loaded and executed. The embodiments and features can be performed on any type of computing device, including portable devices, workstations, servers, mobile wireless devices, and so on.
Claims (20)
1. A method performed by one or more computing devices, the method comprising:
accessing a plurality of types of telemetry data sources, each telemetry data source comprising a different type of software performance feedback received from devices;
forming pluralities of data sets from each of the respective telemetry data sources, each data set comprising indicia of counts of software event instances on the devices as a function of time from a corresponding telemetry data source;
analyzing each data set to determine whether the data set comprises an anomaly;
computing correlations of software updates of the devices with the data sets determined to comprise respective anomalies, wherein multiple software updates are applied to the devices during timespans that at least partially overlap with timespans of the data sets; and
automatically controlling distribution, application, or availability of one or more of the software updates based on the one or more of the software updates having been sufficiently correlated with one or more of the data sets determined to comprise anomalies.
2. A method according to claim 1 , wherein the controlling comprises sending messages to the devices, the messages identifying a software update.
3. A method according to claim 1 , wherein the timing of distribution of a software update to the devices depends on the correlations.
4. A method according to claim 1 , further comprising accessing source code data linking a source code file of an update with an element in the pluralities of data sets, and, based thereon, selecting one of the software updates.
5. A method according to claim 1 , wherein the automatically controlling comprises sending a message that causes a software update to be sent to at least some of the devices.
6. A method according to claim 5 , wherein the software update either installs or uninstalls code corresponding to a source code file determined to link the software update with an anomaly.
7. A method performed by one or more server computer accessible to user computing devices via a data network, the method comprising:
receiving from the user computing devices, via the data network, telemetry data reports associated with a software product, each computing device comprising the software product installed therein and a reporting module that builds and transmits the telemetry data reports, and each telemetry data report comprising diagnostic information of the software product;
storing the diagnostic information from the telemetry data reports in a data store;
accessing software update information to identify software updates for the software product that have been applied to the user computing devices;
analyzing the diagnostic information in the data store to identify anomalies associated with the software product;
determining correlations between the anomalies and the identified software updates; and
based on the correlations, communicating with the user computing devices to automatically install and/or uninstall a software update configured to update the software product.
8. A method according to claim 7 , wherein the software update comprises one of the identified software updates.
9. A method according to claim 7 , wherein the telemetry data reports comprise indicia of events on the user computing devices and respective times thereof, and wherein the correlations are determined according to the times of the events.
10. A method according to claim 9 , wherein the correlations are determined further according to frequencies of the events.
11. A method according to claim 7 , wherein the telemetry data comprises crash data captured in association with crashes of the software product on the user computing devices
12. A method according to claim 7 , further comprising, based on the correlations, inhibiting rollout of a software correlated with an anomaly.
13. A method according to claim 7 , further comprising accessing update data identifying which software updates have been applied to which user computing devices and determining the correlations according to the update data.
14. A method according to claim 13 , wherein the update data is derived from the telemetry data reports.
15. A method according to claim 7 , wherein the software product comprises the reporting module.
16. A method performed by one or more server devices, the method comprising:
accessing a data store to obtain telemetry data stored in the data store, the telemetry data associated with a software application, the telemetry data provided by client computing devices via a network, each client computing device comprising the software application installed thereon, the telemetry data comprising diagnostic information generated by execution of the software product on the client computing devices;
based on the diagnostic information, identifying anomalies of the software application and times associated with the anomalies;
based on the times associated with the anomalies, automatically selecting a software update configured to update the software application; and
transmitting a message configured to cause the selected software update to be installed or uninstalled on at least some of the client computing devices.
17. A method according to claim 16 , wherein the selected software update comprises a package configured to add, remove, or replace an executable file corresponding to the software application.
18. A method according to claim 16 , wherein the telemetry data comprises information derived from stack traces of the software product.
19. A method according to claim 16 , further comprising enabling display of a user interface, the user interface comprising a graph representing an identified anomaly.
20. A method according to claim 19 , the user interface further comprising a graphic representation of the selected software update.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/453,782 US20170177468A1 (en) | 2015-04-01 | 2017-03-08 | Anomaly Analysis For Software Distribution |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/676,214 US9626277B2 (en) | 2015-04-01 | 2015-04-01 | Anomaly analysis for software distribution |
US15/453,782 US20170177468A1 (en) | 2015-04-01 | 2017-03-08 | Anomaly Analysis For Software Distribution |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/676,214 Continuation US9626277B2 (en) | 2015-04-01 | 2015-04-01 | Anomaly analysis for software distribution |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170177468A1 true US20170177468A1 (en) | 2017-06-22 |
Family
ID=55752711
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/676,214 Expired - Fee Related US9626277B2 (en) | 2015-04-01 | 2015-04-01 | Anomaly analysis for software distribution |
US15/453,782 Abandoned US20170177468A1 (en) | 2015-04-01 | 2017-03-08 | Anomaly Analysis For Software Distribution |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/676,214 Expired - Fee Related US9626277B2 (en) | 2015-04-01 | 2015-04-01 | Anomaly analysis for software distribution |
Country Status (4)
Country | Link |
---|---|
US (2) | US9626277B2 (en) |
EP (1) | EP3278223A1 (en) |
CN (1) | CN107533504A (en) |
WO (1) | WO2016160381A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885433A (en) * | 2017-11-23 | 2018-04-06 | 广东欧珀移动通信有限公司 | Control method, device, terminal, server and the storage medium of terminal device |
US10380339B1 (en) * | 2015-06-01 | 2019-08-13 | Amazon Technologies, Inc. | Reactively identifying software products exhibiting anomalous behavior |
US10387899B2 (en) * | 2016-10-26 | 2019-08-20 | New Relic, Inc. | Systems and methods for monitoring and analyzing computer and network activity |
US11341018B2 (en) * | 2018-10-08 | 2022-05-24 | Acer Cyber Security Incorporated | Method and device for detecting abnormal operation of operating system |
US11550925B2 (en) | 2021-03-24 | 2023-01-10 | Bank Of America Corporation | Information security system for identifying potential security threats in software package deployment |
US11625315B2 (en) * | 2019-05-29 | 2023-04-11 | Microsoft Technology Licensing, Llc | Software regression recovery via automated detection of problem change lists |
Families Citing this family (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9172591B2 (en) * | 2012-02-06 | 2015-10-27 | Deepfield Networks | System and method for management of cloud-based systems |
US10133614B2 (en) * | 2015-03-24 | 2018-11-20 | Ca, Inc. | Anomaly classification, analytics and resolution based on annotated event logs |
US9626277B2 (en) * | 2015-04-01 | 2017-04-18 | Microsoft Technology Licensing, Llc | Anomaly analysis for software distribution |
US10536357B2 (en) | 2015-06-05 | 2020-01-14 | Cisco Technology, Inc. | Late data detection in data center |
US10142353B2 (en) | 2015-06-05 | 2018-11-27 | Cisco Technology, Inc. | System for monitoring and managing datacenters |
US10140171B2 (en) * | 2016-04-14 | 2018-11-27 | International Business Machines Corporation | Method and apparatus for downsizing the diagnosis scope for change-inducing errors |
US10191800B2 (en) * | 2016-04-29 | 2019-01-29 | Cisco Technology, Inc. | Metric payload ingestion and replay |
US10685292B1 (en) * | 2016-05-31 | 2020-06-16 | EMC IP Holding Company LLC | Similarity-based retrieval of software investigation log sets for accelerated software deployment |
US10268563B2 (en) * | 2016-06-23 | 2019-04-23 | Vmware, Inc. | Monitoring of an automated end-to-end crash analysis system |
US10331508B2 (en) | 2016-06-23 | 2019-06-25 | Vmware, Inc. | Computer crash risk assessment |
US10338990B2 (en) | 2016-06-23 | 2019-07-02 | Vmware, Inc. | Culprit module detection and signature back trace generation |
US10365959B2 (en) | 2016-06-23 | 2019-07-30 | Vmware, Inc. | Graphical user interface for software crash analysis data |
US10191837B2 (en) | 2016-06-23 | 2019-01-29 | Vmware, Inc. | Automated end-to-end analysis of customer service requests |
US10372434B1 (en) * | 2016-07-22 | 2019-08-06 | Amdocs Development Limited | Apparatus, computer program, and method for communicating an update to a subset of devices |
US20180032905A1 (en) * | 2016-07-29 | 2018-02-01 | Appdynamics Llc | Adaptive Anomaly Grouping |
US10521590B2 (en) * | 2016-09-01 | 2019-12-31 | Microsoft Technology Licensing Llc | Detection dictionary system supporting anomaly detection across multiple operating environments |
US10346762B2 (en) | 2016-12-21 | 2019-07-09 | Ca, Inc. | Collaborative data analytics application |
US10409367B2 (en) | 2016-12-21 | 2019-09-10 | Ca, Inc. | Predictive graph selection |
US10289526B2 (en) | 2017-02-06 | 2019-05-14 | Microsoft Technology Licensing, Llc | Object oriented data tracking on client and remote server |
US10747520B2 (en) | 2017-04-14 | 2020-08-18 | Microsoft Technology Licensing, Llc | Resource deployment using device analytics |
US11176464B1 (en) | 2017-04-25 | 2021-11-16 | EMC IP Holding Company LLC | Machine learning-based recommendation system for root cause analysis of service issues |
US10496469B2 (en) | 2017-07-25 | 2019-12-03 | Aurora Labs Ltd. | Orchestrator reporting of probability of downtime from machine learning process |
US10891219B1 (en) | 2017-08-07 | 2021-01-12 | Electronic Arts Inc. | Code failure prediction system |
US10372438B2 (en) | 2017-11-17 | 2019-08-06 | International Business Machines Corporation | Cognitive installation of software updates based on user context |
US10963330B2 (en) * | 2017-11-24 | 2021-03-30 | Microsoft Technology Licensing, Llc | Correlating failures with performance in application telemetry data |
US11023218B1 (en) * | 2017-12-31 | 2021-06-01 | Wells Fargo Bank, N.A. | Metadata driven product configuration management |
US10929217B2 (en) | 2018-03-22 | 2021-02-23 | Microsoft Technology Licensing, Llc | Multi-variant anomaly detection from application telemetry |
US20190334759A1 (en) * | 2018-04-26 | 2019-10-31 | Microsoft Technology Licensing, Llc | Unsupervised anomaly detection for identifying anomalies in data |
US11651272B2 (en) * | 2018-07-06 | 2023-05-16 | Sap Se | Machine-learning-facilitated conversion of database systems |
KR102610730B1 (en) * | 2018-09-05 | 2023-12-07 | 현대자동차주식회사 | Apparatus for providing update of vehicle and computer-readable storage medium |
US10956307B2 (en) | 2018-09-12 | 2021-03-23 | Microsoft Technology Licensing, Llc | Detection of code defects via analysis of telemetry data across internal validation rings |
CN109582485B (en) * | 2018-10-26 | 2022-05-03 | 创新先进技术有限公司 | Configuration change abnormity detection method and device |
US11204755B2 (en) * | 2018-11-13 | 2021-12-21 | Split Software, Inc. | Systems and methods for providing event attribution in software application |
US11263116B2 (en) | 2019-01-24 | 2022-03-01 | International Business Machines Corporation | Champion test case generation |
US11010282B2 (en) | 2019-01-24 | 2021-05-18 | International Business Machines Corporation | Fault detection and localization using combinatorial test design techniques while adhering to architectural restrictions |
US11106567B2 (en) | 2019-01-24 | 2021-08-31 | International Business Machines Corporation | Combinatoric set completion through unique test case generation |
US11099975B2 (en) | 2019-01-24 | 2021-08-24 | International Business Machines Corporation | Test space analysis across multiple combinatoric models |
US10922210B2 (en) * | 2019-02-25 | 2021-02-16 | Microsoft Technology Licensing, Llc | Automatic software behavior identification using execution record |
US10785105B1 (en) * | 2019-03-12 | 2020-09-22 | Microsoft Technology Licensing, Llc | Dynamic monitoring on service health signals |
US11720461B2 (en) | 2019-03-12 | 2023-08-08 | Microsoft Technology Licensing, Llc | Automated detection of code regressions from time-series data |
US10963366B2 (en) | 2019-06-13 | 2021-03-30 | International Business Machines Corporation | Regression test fingerprints based on breakpoint values |
US10970197B2 (en) * | 2019-06-13 | 2021-04-06 | International Business Machines Corporation | Breakpoint value-based version control |
US11232020B2 (en) | 2019-06-13 | 2022-01-25 | International Business Machines Corporation | Fault detection using breakpoint value-based fingerprints of failing regression test cases |
US11422924B2 (en) | 2019-06-13 | 2022-08-23 | International Business Machines Corporation | Customizable test set selection using code flow trees |
US10970195B2 (en) | 2019-06-13 | 2021-04-06 | International Business Machines Corporation | Reduction of test infrastructure |
US10891128B1 (en) * | 2019-08-07 | 2021-01-12 | Microsoft Technology Licensing, Llc | Software regression detection in computing systems |
US11244012B2 (en) | 2019-11-06 | 2022-02-08 | Kyndryl, Inc. | Compliance by clustering assets according to deviations |
US11294804B2 (en) * | 2020-03-23 | 2022-04-05 | International Business Machines Corporation | Test case failure with root cause isolation |
US11179644B2 (en) | 2020-03-30 | 2021-11-23 | Electronic Arts Inc. | Videogame telemetry data and game asset tracker for session recordings |
JP7073438B2 (en) * | 2020-04-14 | 2022-05-23 | 日本電子株式会社 | Automatic analyzer and control method of automated analyzer |
US11446570B2 (en) | 2020-05-08 | 2022-09-20 | Electronic Arts Inc. | Automated test multiplexing system |
US20210406150A1 (en) * | 2020-06-25 | 2021-12-30 | Segment.io, Inc. | Application instrumentation and event tracking |
US20240069999A1 (en) * | 2022-08-31 | 2024-02-29 | Microsoft Technology Licensing, Llc | Detecting and mitigating cross-layer impact of change events on a cloud computing system |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6141683A (en) | 1998-01-30 | 2000-10-31 | Lucent Technologies, Inc. | Method for remotely and reliably updating of the software on a computer with provision for roll back |
US6650932B1 (en) | 2000-05-15 | 2003-11-18 | Boston Medical Technologies, Inc. | Medical testing telemetry system |
US7409593B2 (en) | 2003-06-30 | 2008-08-05 | At&T Delaware Intellectual Property, Inc. | Automated diagnosis for computer networks |
US20050097199A1 (en) | 2003-10-10 | 2005-05-05 | Keith Woodard | Method and system for scanning network devices |
US20050132351A1 (en) | 2003-12-12 | 2005-06-16 | Randall Roderick K. | Updating electronic device software employing rollback |
KR100750132B1 (en) * | 2005-09-27 | 2007-08-21 | 삼성전자주식회사 | Method and system for booting, updating software automatically and recovering update error, and computer readable medium recording the method |
US8719809B2 (en) | 2006-12-22 | 2014-05-06 | Commvault Systems, Inc. | Point in time rollback and un-installation of software |
US8655623B2 (en) * | 2007-02-13 | 2014-02-18 | International Business Machines Corporation | Diagnostic system and method |
US20080201705A1 (en) | 2007-02-15 | 2008-08-21 | Sun Microsystems, Inc. | Apparatus and method for generating a software dependency map |
US8745703B2 (en) * | 2008-06-24 | 2014-06-03 | Microsoft Corporation | Identifying exploitation of vulnerabilities using error report |
US8561179B2 (en) * | 2008-07-21 | 2013-10-15 | Palo Alto Research Center Incorporated | Method for identifying undesirable features among computing nodes |
US9152530B2 (en) | 2009-05-14 | 2015-10-06 | Oracle America, Inc. | Telemetry data analysis using multivariate sequential probability ratio test |
US20110314138A1 (en) | 2010-06-21 | 2011-12-22 | Hitachi, Ltd. | Method and apparatus for cause analysis configuration change |
US20110321007A1 (en) | 2010-06-29 | 2011-12-29 | International Business Machines Corporation | Targeting code sections for correcting computer program product defects using records of a defect tracking system |
US9766962B2 (en) | 2012-06-07 | 2017-09-19 | Vmware, Inc. | Correlating performance degradation of applications to specific changes made to applications |
EP2862077A4 (en) | 2012-06-15 | 2016-03-02 | Cycle Computing Llc | Method and system for automatically detecting and resolving infrastructure faults in cloud infrastructure |
US20140033174A1 (en) | 2012-07-29 | 2014-01-30 | International Business Machines Corporation | Software bug predicting |
US8984331B2 (en) * | 2012-09-06 | 2015-03-17 | Triumfant, Inc. | Systems and methods for automated memory and thread execution anomaly detection in a computer network |
US9448792B2 (en) * | 2013-03-14 | 2016-09-20 | Microsoft Technology Licensing, Llc | Automatic risk analysis of software |
US9558056B2 (en) * | 2013-07-28 | 2017-01-31 | OpsClarity Inc. | Organizing network performance metrics into historical anomaly dependency data |
US9378079B2 (en) * | 2014-09-02 | 2016-06-28 | Microsoft Technology Licensing, Llc | Detection of anomalies in error signals of cloud based service |
US9904584B2 (en) * | 2014-11-26 | 2018-02-27 | Microsoft Technology Licensing, Llc | Performance anomaly diagnosis |
US10133614B2 (en) * | 2015-03-24 | 2018-11-20 | Ca, Inc. | Anomaly classification, analytics and resolution based on annotated event logs |
US9626277B2 (en) * | 2015-04-01 | 2017-04-18 | Microsoft Technology Licensing, Llc | Anomaly analysis for software distribution |
-
2015
- 2015-04-01 US US14/676,214 patent/US9626277B2/en not_active Expired - Fee Related
-
2016
- 2016-03-21 EP EP16716336.9A patent/EP3278223A1/en not_active Withdrawn
- 2016-03-21 CN CN201680020978.1A patent/CN107533504A/en not_active Withdrawn
- 2016-03-21 WO PCT/US2016/023337 patent/WO2016160381A1/en active Application Filing
-
2017
- 2017-03-08 US US15/453,782 patent/US20170177468A1/en not_active Abandoned
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10380339B1 (en) * | 2015-06-01 | 2019-08-13 | Amazon Technologies, Inc. | Reactively identifying software products exhibiting anomalous behavior |
US10387899B2 (en) * | 2016-10-26 | 2019-08-20 | New Relic, Inc. | Systems and methods for monitoring and analyzing computer and network activity |
CN107885433A (en) * | 2017-11-23 | 2018-04-06 | 广东欧珀移动通信有限公司 | Control method, device, terminal, server and the storage medium of terminal device |
US11341018B2 (en) * | 2018-10-08 | 2022-05-24 | Acer Cyber Security Incorporated | Method and device for detecting abnormal operation of operating system |
US11625315B2 (en) * | 2019-05-29 | 2023-04-11 | Microsoft Technology Licensing, Llc | Software regression recovery via automated detection of problem change lists |
US11550925B2 (en) | 2021-03-24 | 2023-01-10 | Bank Of America Corporation | Information security system for identifying potential security threats in software package deployment |
Also Published As
Publication number | Publication date |
---|---|
WO2016160381A1 (en) | 2016-10-06 |
US20160292065A1 (en) | 2016-10-06 |
CN107533504A (en) | 2018-01-02 |
US9626277B2 (en) | 2017-04-18 |
EP3278223A1 (en) | 2018-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9626277B2 (en) | Anomaly analysis for software distribution | |
US20210294716A1 (en) | Continuous software deployment | |
Kabinna et al. | Examining the stability of logging statements | |
US10783051B2 (en) | Performance regression framework | |
Decan et al. | What do package dependencies tell us about semantic versioning? | |
US10346282B2 (en) | Multi-data analysis based proactive defect detection and resolution | |
US9836299B2 (en) | Optimizing software change processes using real-time analysis and rule-based hinting | |
US10769250B1 (en) | Targeted security monitoring using semantic behavioral change analysis | |
US7421490B2 (en) | Uniquely identifying a crashed application and its environment | |
WO2019182932A1 (en) | Unified test automation system | |
US10984109B2 (en) | Application component auditor | |
US20140109053A1 (en) | Identifying high impact bugs | |
US9712418B2 (en) | Automated network control | |
US11093319B2 (en) | Automated recovery of webpage functionality | |
US11106520B2 (en) | Systems and methods for preventing client application crashes due to operating system updates | |
US11366713B2 (en) | System and method for automatically identifying and resolving computing errors | |
WO2014120192A1 (en) | Error developer association | |
CN104657255A (en) | Computer-implemented method and system for monitoring information technology systems | |
US11113169B2 (en) | Automatic creation of best known configurations | |
Linares-Vásquez | Supporting evolution and maintenance of android apps | |
US10305738B2 (en) | System and method for contextual clustering of granular changes in configuration items | |
Browne et al. | Comprehensive, open‐source resource usage measurement and analysis for HPC systems | |
Decan et al. | On the outdatedness of workflows in the GitHub Actions ecosystem | |
Bolduc | Lessons learned: Using a static analysis tool within a continuous integration system | |
US20230237366A1 (en) | Scalable and adaptive self-healing based architecture for automated observability of machine learning models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:THANGAMANI, AARTHI;NITTA, BRYSTON;DAY, CHRIS;AND OTHERS;REEL/FRAME:041510/0520 Effective date: 20150327 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |