WO2019046996A1

WO2019046996A1 - Java software latency anomaly detection

Info

Publication number: WO2019046996A1
Application number: PCT/CN2017/100457
Authority: WO
Inventors: Kingsum Chow; Wanyi ZHU; Chuansheng LU; Jiapeng LI; Sanhong Li
Original assignee: Alibaba Group Holding Limited
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2019-03-14

Abstract

Systems and methods provided herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods. Profile data of an application running in a computing device is collected and a predetermined set of data is excluded from the profile data. Application methods included in remaining profile data are analyzed statistically to identify a first set of application methods having abnormal latencies, and analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods is then generated.

Description

JAVA SOFTWARE LATENCY ANOMALY DETECTION

BACKGROUND

As the use of computers and software running on the computers continue to grow in everyday transactions, any latencies in such transactions have a range of consequences. For small scale transactions, such as a personal online shopping, associated consequences may be limited to the individual user being annoyed from a lack of response. For large scale transactions, such as commercial, industrial, or security transactions, the associated consequences may result in a critical commercial or security transaction failing to complete. Additionally, many interacting devices are often updated or replaced with newer software and/or hardware, and unintended incompatibility among the devices may introduce latencies.

Data center software developers and operators may make use of software method latencies to figure out sources of bad response times, where each software method, or application method, is a set of code which is referred to by its own name and can be called or invoked at any point in a program by utilizing the name of the software method. While commercially available tools, such as Java Development Kit (JDK) , can collect application method latencies, analysis of large amount of such data may be difficult.

A Java method tracing capability implemented in a customized JDK product may provide diagnostic and profiling options to track the performance of Java applications via instrumentation. To minimize the overhead incurred by instrumentation, instructions may be added at the entry and exit of a compiled Java method by modifying interpreter and compilers in a Java Virtual Machine (JVM) , which allows users to track the entry time and exit time of an execution of the Java application method. The Java method tracing capability may also provide application programming interfaces (APIs) for profiling a system to specify which thread could be traced, and allow enabling and disabling of this tracing ability on the fly per thread basis. The Java method tracing capability may allow the user to get comprehensive information regarding a code flow in a Java application.

However, there are challenges, or additional desired capabilities, regarding the Java profiling, such as 1) how to automatically detect anomalous performance behaviors, 2) how to automatically identify a method with abnormal high-latency behavior, 3) how to identify workload or incoming web requests affected by a long-latency behavior in a responsive and timely manner, and 4) how to prioritize identified issues.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.

FIG. 1 illustrates an example environment in which a software latency anomaly may be detected.

FIG. 2 illustrates an example flow diagram for identifying an application method having a software latency anomaly.

FIG. 3 illustrates an example process detailing one of the blocks of FIG. 2.

FIG. 4 illustrates an example process detailing one of the blocks of FIG. 2.

FIG. 5 illustrates an example block diagram of a system for identifying an application method having an abnormal latency.

DETAILED DESCRIPTION

Systems and methods discussed herein are directed to automatically detecting latency anomaly from a large amount of application method latency data by a combination of statistical and clustering analyses and machine learning methods.

The systems and methods discussed herein allow users to automatically identify perceptible performance inefficiency by focusing on latency of an application method without stepping through a code, which may be used by companies running large data centers and using Java as their primary development language.

FIG. 1 illustrates an example environment 100 in which a software latency anomaly in an application may be detected. Software latency may be interchangeably referred as application latency, and a software method may be interchangeably referred as an application method. A data center 102 (which may be a single computing device or a cloud computing center comprising a plurality of computing devices and/or servers) is communicatively coupled, via a network 104, to a plurality user devices, of which four of the user devices 106, 108, 110, and 112 are shown. The data center 102 and the plurality of user devices 106, 108, 110, and 112 are also communicatively coupled to a monitoring computing device 106 via the network 104. The monitoring computing device 106 is configured to monitor the application running in the data center 102 for abnormal latencies of application methods in the application. The application running in the data center 102 that is monitored by the monitoring computing device 114 may be a Java application.

FIG. 2 illustrates an example flow diagram 200 for identifying an application method having a software latency anomaly.

In block 202, profile data of the application, such as a Java application running in the data center 102, may be collected. The profile data may comprise various parameters associated with the application, such as, but not limited to 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one service to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with that application method； and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself.

In block 204, a predetermined set of data may be excluded from the profile data for further evaluation. The predetermined set of data to be excluded may comprise 1) application methods having incomplete call stacks, which are known to occur very rarely 2) application standard libraries which are commonly available and used, and not unique to the application running, and 3) application methods having inclusive times that are less than a threshold time, which are considered to be behaving normally.

In block 206, the application methods included in the remaining profile data may be statistically analyzed to identify a first set of application methods having abnormal latencies, and in block 208, the application methods included in the remaining profile data may be analyzed using clustering algorithms to identify a second set of application methods. A list of application methods having abnormal latencies may then be generated in block 210 based on at least one of the first set of application methods or the second set of application methods.

FIG. 3 illustrates an example process 300 detailing statistically analysis performed in block 206 of FIG. 2.

In block 302, an application method at the bottom of each call stack (bottom application method) may be selected in order of when the call stack was created, and in block 304, all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack may be selected. The selected TraceIDs may then be grouped into two groups in block 306, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range. The TraceIds with the bottom application method whose inclusive time lays outside the outer fences (Q1 –3 IQR, Q3 + 3 IQR) may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.

In block 308, a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group may be calculated. The call stacks with the same stack depths or the same first number may be considered to be the same. If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.

In block 310, a critical MethodID may be identified by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal. In block 312, the process may repeat from block 304 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 314, critical MethodIDs may be ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.

FIG. 4 illustrates an example process 400 detailing analysis using clustering algorithms performed in block 208 of FIG. 2.

In block 402, an application method based on an occurrence frequency of the application method in a log file of the application may be selected. In block 404, the profile data associated with the selected application method, such as associate TraceID, and associated inclusive and exclusive times, may be provided to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) . Based on results from the hierarchical clustering algorithm in block 404, an outlier may be identified in block 406 based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods. The outlier may be identified using a machine-learning-based anomaly detection process.

In block 408, the profile data associated the selected application method may be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm, and in block 410, based on results from the DBSCAN algorithm, an outlier having a cluster with a low density may be identified.

In block 412, the process repeats from block 402 if there were another application method to be evaluated. If there were no more application method to be processed, then in block 414, MethodIDs associated with the clusters identified as outliers and associated TraceIDs may be ranked based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.

The Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID may be calculated as:

where:

distance (t_incl, t_excl) is the Euclidean distance,

t_incl is the inclusive time of the application method associated with the identified MethodID,

median (t_incl) is a median of inclusive times of the application method in an overall population of the application methods,

t_excl is the exclusive time of the application method associated with the identified MethodID, and

median (t_excl) is a median of exclusive times of the application method in an overall population of the application methods.

FIG. 5 illustrates an example block diagram of a system 502 for identifying an application method having an abnormal latency. The system 502 may be, or may reside in, the monitoring computing device 106. The system 502 may comprise one or more processors 504 and memory 506 coupled to the one or more processors 504. The memory 506 may comprise various modules communicatively coupled to each other which are executable by the one or more processors 504. As illustrated in FIG. 1, the system 502 may be communicatively coupled to the data center 102 and the user device 1016, 108, 110, and 112 via the network 104.

The memory 506 may comprise various modules that are communicatively coupled to each other. The various modules may comprise a profile data module 508, a statistical analysis module 510 a clustering algorithm module 512, and a list generator module 514. As discussed above with reference to FIG. 2, the profile data module 508 may be configured to collect profile data of an application, such as a Java application, running in the data center 102, where the profile data may comprise 1) a TraceID which is a unique identifier of each web request and enables a user to search across all collected logs and trace how a request is passed from one microservice to the next, 2) a ThreadID which is a unique numeric identifier of each application thread running in the data center 102, 3) a MethodID which is a unique identifier for each application method of the application, 4) an inclusive time of each application method, where the inclusive time is a total time interval spent by an application method and one or more called application methods associated with the application method； and 5) an exclusive time of each application method where the exclusive time is a time interval spent only by the application method itself. The profile data module 508 may further be configured to exclude from the profile data, at least one of application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time.

As discussed above with reference to FIG. 3, the statistical analysis module 510 may be configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies. The statistical analysis module 510 may select an application method at the bottom of each call stack (bottom application method) in order of when the call stack was created, and may select all TraceIDs associated with call stacks having the same bottom application method at the bottom of the call stack. The statistical analysis module 510 may group the selected TraceIDs into two groups, a normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and an abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range. The TraceIds with the bottom application method whose inclusive time lays outside the outer fences (Q1 –3 IQR, Q3 + 3 IQR) may be considered to be extreme outliers, where Q1 is the first quartile of the inclusive times of the bottom application method from the selected TraceIDs, Q3 is the third quartile of the inclusive times of the bottom application method from the selected TraceIDs, and IQR is the interquartile range which is equal to Q3 –Q1.

The statistical analysis module 510 may then calculate a ratio of an inclusive time of an application method from the abnormal TraceID group to an inclusive time of the same application method in the same call stack from the normal TraceID group. The call stacks with the same stack depths or the same first number may be considered to be the same, If the ratio were higher than a threshold value, an indicator for a MethodID associated with the application method used to calculate the ratio would be set to TRUE, otherwise the indicator for the associated MethodID would be set to FALSE indicating that it is normal.

The statistical analysis module 510 may then identify a critical MethodID by selecting a top most application method where a corresponding call stack is deviating significantly from the normal behavior and changing from normal to abnormal. The statistical analysis module 510 may repeat the process above until all application methods are evaluated, and may rank critical MethodIDs ranked based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.

As discussed above with reference to FIG. 4, the clustering algorithm module 512 may select an application method based on an occurrence frequency of the application method in a log file of the application. The profile data associated with the selected application method, such as associate TraceID, and associated inclusive and exclusive times, are provided to the hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) of the clustering algorithm module 512. The clustering algorithm module 512 may then identify an outlier based on a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods. The outlier may be identified using a machine-learning-based anomaly detection process.

The profile data associated the selected application method may also be provided to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm of the clustering algorithm module 512. Based on results from the DBSCAN algorithm, the clustering algorithm module 512 may identify an outlier having a cluster with a low density.

The clustering algorithm module 510 may repeat the process above until all application methods are evaluated, and may rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds, and based on an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.

where:

distance (t_incl, t_excl) is the Euclidean distance,

Some or all operations of the methods described above can be performed by execution of computer-readable instructions stored on a computer-readable storage medium, as defined below. The term “computer-readable instructions” as used in the description and claims, include routines, applications, application modules, program modules, programs, components, data structures, algorithms, and the like. Computer-readable instructions can be implemented on various system configurations, including single-processor or multiprocessor systems, minicomputers, mainframe computers, personal computers, hand-held computing devices, microprocessor-based, programmable consumer electronics, combinations thereof, and the like.

The computer-readable storage media may include volatile memory (such as random access memory (RAM) ) and/or non-volatile memory (such as read-only memory (ROM) , flash memory, etc. ) . The computer-readable storage media may also include additional removable storage and/or non-removable storage including, but not limited to, flash memory, magnetic storage, optical storage, and/or tape storage that may provide non-volatile storage of computer-readable instructions, data structures, program modules, and the like.

A non-transient computer-readable storage medium is an example of computer-readable media. Computer-readable media includes at least two types of computer-readable media, namely computer-readable storage media and communications media. Computer-readable storage media includes volatile and non-volatile, removable and non-removable media implemented in any process or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media includes, but is not limited to, phase change memory (PRAM) , static random-access memory (SRAM) , dynamic random-access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technology, compact disk read-only memory (CD-ROM) , digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer-readable storage media do not include communication media.

The computer-readable instructions stored on one or more non-transitory computer-readable storage media that, when executed by one or more processors, may perform operations described above with reference to FIGs. 2-5. Generally, computer-readable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.

EXAMPLE CLAUSES

A. A method for identifying an application method, having an abnormal latency, of an application running in a data center, the method comprising: collecting profile data of the application； excluding a predetermined set of data from the profile data； statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies； analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods； and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.

B. A method as paragraph A recites, wherein the application method is a Java application method and the application running in the data center is a Java application.

C. A method as paragraph A recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.

D. A method as paragraph A recites, wherein the profile data comprises: a TraceID being a unique identifier of each web request； a ThreadID being a unique numeric identifier of each thread running in the data center； a MethodID being a unique identifier for each application method； an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method； and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.

E. A method paragraph D recites, wherein the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks； application standard libraries； or application methods having inclusive times that are less than a threshold time.

F. A method as paragraph E recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created； selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack； grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range； for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal； and ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.

G. A method as paragraph F recites, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application； providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ； based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods； providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm； based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density； and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.

H. A method as paragraph G recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is:

where distance (t_incl, t_excl) is the Euclidean distance, t_incl is the inclusive time of the application method associated with the identified MethodID, median (t_incl) is a median of inclusive times of the application method in an overall population of the application methods, t_excl is the exclusive time of the application method associated with the identified MethodID, and median (t_excl) is a median of exclusive times of the application method in an overall population of the application methods.

I. One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising: collecting profile data of an application running in a data center； excluding a predetermined set of data from the profile data； statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies； analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods； and generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.

J. One or more non-transitory computer-readable storage media as paragraph I recites, wherein the wherein the application method is a Java application method and the application running in the data center is a Java application.

K. One or more non-transitory computer-readable storage media as paragraph I recites, wherein the data center comprises one or more computing devices, one or more servers, or a cloud computing center.

L. One or more non-transitory computer-readable storage media as I recites, wherein the profile data comprises: a TraceID being a unique identifier of each web request； a ThreadID being a unique numeric identifier of each thread running in the data center； a MethodID being a unique identifier for each application method； an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method； and an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.

M. One or more non-transitory computer-readable storage media as paragraph L recites, the predetermined set of data excluded from the profile data comprises at least one of: application methods having incomplete call stacks； application standard libraries； or application methods having inclusive times that are less than a threshold time.

N. One or more non-transitory computer-readable storage media as paragraph M recites, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises: selecting a bottom application method at a bottom of each call stack in order of when a call stack is created； selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack； grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range； for all application methods in the remaining profile data: calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal； and ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.

O. One or more non-transitory computer-readable storage media as paragraph N recites, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises: selecting an application method based on an occurrence frequency of the application method in a log file of the application； providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ； based on results from the hierarchical clustering algorithm, identifying, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods； providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm； based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density； and ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.

P. One or more non-transitory computer-readable storage media as paragraph O recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is:

wherein: distance (t_incl, t_excl) is the Euclidean distance, t_incl is the inclusive time of the application method associated with the identified MethodID, median (t_incl) is a median of inclusive times of the application method in an overall population of the application methods, t_excl is the exclusive time of the application method associated with the identified MethodID, and median (t_excl) is a median of exclusive times of the application method in an overall population of the application methods.

Q. A system for identifying an application method having an abnormal latency, the system comprising: one or more processors； and memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising: a profile data module configured to collect profile data of an application running in a data center and to exclude, from the profile data, at least one of: application methods having incomplete call stacks, application standard libraries, or application methods having inclusive times that are less than a threshold time； a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies； a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods； and a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.

R. A system as paragraph Q recites, wherein the statistical analysis module is further configured to: select a bottom application method at a bottom of each call stack in order of when a call stack is created； select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack； group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range； for all application methods in the remaining profile data: calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal； and rank critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.

S. A system as paragraph R recites, wherein the clustering algorithm module is further configured to: select an application method based on an occurrence frequency of the application method in a log file of the application； provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ； based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster having a low number of records and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods； provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm； based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density； and rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and an Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.

T. A system as paragraph S recites, wherein the Euclidean distance of the inclusive time and the exclusive time of the application method associated with the identified MethodID is:

CONCLUSION

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.

Claims

A method comprising:

collecting profile data of an application running in a computing device；

excluding a predetermined set of data from the profile data； and

identifying a first set of application methods having abnormal latencies.
A method of claim 1, wherein the application method is a Java application method and the application running in the computing device is a Java application.
A method of claim 1, wherein the computing device comprises one or more computing devices, one or more servers, or a cloud computing center.
A method of claim 1, wherein the profile data comprises:

a TraceID being a unique identifier of each web request；

a ThreadID being a unique numeric identifier of each thread running in the computing device；

a MethodID being a unique identifier for each application method；

an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method； and

an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
A method of claim 4, wherein the predetermined set of data excluded from the profile data comprises at least one of:

application methods having incomplete call stacks；

application standard libraries； or

application methods having inclusive times that are less than a threshold time.
A method of claim 1, wherein identifying the first set of application methods having abnormal latencies comprises:

statistically analyzing application methods includes in remaining profile data to identify the first set of application methods having abnormal latencies.
A method of claim 6, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises:

selecting a bottom application method at a bottom of each call stack in order of when a call stack is created；

selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack；

grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range；

for all application methods in the remaining profile data:

calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and

identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal； and

ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
A method of claim 6, further comprising:

analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods； and

generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
A method of claim 8, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises:

selecting an application method based on an occurrence frequency of the application method in a log file of the application；

providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ；

based on results from the hierarchical clustering algorithm, identifying, as an outlier, a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the selected application methods；

providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm；

based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density； and

ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
One or more non-transitory computer-readable storage media storing computer-readable instructions executable by one or more processors, that when executed by the one or more processors, cause the one or more processors to perform operations comprising:

collecting profile data of an application running in a computing device；

excluding a predetermined set of data from the profile data； and

identifying a first set of application methods having abnormal latencies.
One or more non-transitory computer-readable storage media of claim 10, wherein the application method is a Java application method and the application running in the computing device is a Java application.
One or more non-transitory computer-readable storage media of claim 10, wherein the computing device comprises one or more computing devices, one or more servers, or a cloud computing center.
One or more non-transitory computer-readable storage media of claim 10, wherein the profile data comprises:

a TraceID being a unique identifier of each web request；

a ThreadID being a unique numeric identifier of each thread running in the computing device；

a MethodID being a unique identifier for each application method；

an inclusive time of each application method, the inclusive time being a total time interval spent by an application method and one or more called application methods associated with the application method； and

an exclusive time of each application method, the exclusive time being a time interval spent only by each application method.
One or more non-transitory computer-readable storage media of claim 13, the predetermined set of data excluded from the profile data comprises at least one of:

application methods having incomplete call stacks；

application standard libraries； or

application methods having inclusive times that are less than a threshold time.
One or more non-transitory computer-readable storage media of claim 10, wherein identifying a first set of application methods having abnormal latencies comprise:

statistically analyzing application methods includes in remaining profile data to identify the first set of application methods having abnormal latencies
One or more non-transitory computer-readable storage media of claim 15, wherein statistically analyzing the application methods included in the remaining profile data to identify the first set of application methods having abnormal latencies comprises:

selecting a bottom application method at a bottom of each call stack in order of when a call stack is created；

selecting all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack；

grouping the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range；

for all application methods in the remaining profile data:

calculating a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and

identifying a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal； and

ranking critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
One or more non-transitory computer-readable storage media of claim 16, wherein the operation further comprises:

analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods； and

generating a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
One or more non-transitory computer-readable storage media of claim 17, wherein analyzing the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods comprises:

selecting an application method based on an occurrence frequency of the application method in a log file of the application；

providing profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ；

based on results from the hierarchical clustering algorithm, identifying, as an outlier, a small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods；

providing profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm；

based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density； and

ranking MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.
A system for identifying an application method having an abnormal latency, the system comprising:

one or more processors； and

memory coupled to the one or more processors, the memory including modules communicatively coupled to each other and executable by the one or more processors, the modules comprising:

a profile data module configured to collect profile data of an application running in a computing device and to exclude, from the profile data, at least one of:

application methods having incomplete call stacks,

application standard libraries, or

application methods having inclusive times that are less than a threshold time；

a statistical analysis module configured to statistically analyzing application methods included in remaining profile data to identify a first set of application methods having abnormal latencies；

a clustering algorithm module configured to analyze the application methods included in the remaining profile data using clustering algorithms to identify a second set of application methods； and

a list generator module configured to generate a list of application methods having abnormal latencies based on at least one of the first set of application methods or the second set of application methods.
A system of claim 19, wherein the statistical analysis module is further configured to:

select a bottom application method at a bottom of each call stack in order of when a call stack is created；

select all TraceIDs associated with call stacks having the bottom application method at a bottom of a call stack；

group the selected TraceIDs into a normal TraceID group and an abnormal TraceID group, the normal TraceID group being a group of normal TraceIds where a normal TraceID has the bottom application method with an expected inclusive time, and the abnormal TraceID group being a group of abnormal TraceIds where an abnormal Trace ID has the bottom application method with an inclusive time outside of a Tukey’s boxplot range；

for all application methods in the remaining profile data:

calculate a ratio of an inclusive time of an application method from the abnormal TraceID group and an inclusive time of the application method in a same call stack from the normal TraceID group, and

identify a critical MethodID by selecting a top most application method where a corresponding call stack is changing from normal to abnormal； and

rank critical MethodIDs based on a frequency of occurrence in different TraceIds and a deviation of an inclusive time of a corresponding application method from a median inclusive time of the normal TraceID group.
A system of claim 20, wherein the clustering algorithm module is further configured to:

select an application method based on an occurrence frequency of the application method in a log file of the application；

provide profile data associated with the selected application method to a hierarchical clustering with mean linkage clustering algorithm (hierarchical clustering algorithm) ；

based on results from the hierarchical clustering algorithm, identify, as an outlier, a cluster small or sparse cluster and a median exclusive time or median inclusive time being higher than a predetermined percentile of an overall population of the application methods；

provide profile data associated the selected application method to a Density Based Clustering of Applications with Noise (DBSCAN) algorithm；

based on results from the DBSCAN algorithm, identifying, as an outlier, a cluster with a low density； and

rank MethodIDs associated with the clusters identified as outliers and associated TraceIDs based on a frequency of occurrence of the associated TraceIds and a Euclidian distance of an inclusive time and an exclusive time of an application method associated with the identified MethodID.