CN114996111A - Method and system for analyzing influence of configuration items on performance of software system - Google Patents

Method and system for analyzing influence of configuration items on performance of software system Download PDF

Info

Publication number
CN114996111A
CN114996111A CN202210736612.1A CN202210736612A CN114996111A CN 114996111 A CN114996111 A CN 114996111A CN 202210736612 A CN202210736612 A CN 202210736612A CN 114996111 A CN114996111 A CN 114996111A
Authority
CN
China
Prior art keywords
performance
software system
configuration
configuration item
influence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210736612.1A
Other languages
Chinese (zh)
Inventor
陈鹏飞
陈志明
关雅雯
郑子彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210736612.1A priority Critical patent/CN114996111A/en
Publication of CN114996111A publication Critical patent/CN114996111A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Stored Programmes (AREA)

Abstract

The invention provides a method and a system for analyzing the influence of configuration items on the performance of a software system, wherein the method comprises the following steps: identifying and marking all performance operations in the software system according to a code mode preset by the software system, wherein the performance operations are time-intensive operations and/or space-intensive operations which influence the performance of the software system; identifying the dependency relationship between each performance operation and each configuration item of the software system to obtain a performance operation set corresponding to each configuration item; constructing a feature vector corresponding to each configuration item according to the performance operation set; and inputting the feature vectors corresponding to the configuration items into the trained qualitative performance influence model, and judging whether the configuration items influence the performance of the software system to obtain a configuration item set influencing the performance of the software system. The invention can find the configuration item set which really influences the performance of the software system on the premise of not running the software system, can improve the configuration efficiency of the software system, and is beneficial to correctly configuring the software system so as to improve the performance of the software system.

Description

Method and system for analyzing influence of configuration items on performance of software system
Technical Field
The invention relates to the technical field of software systems, in particular to a method and a system for analyzing influence of configuration items on performance of a software system.
Background
Software Systems (Software Systems) of a computer refer to various programs, data and related documents operated by the computer, and include system Software, supporting Software and application Software. A large number of modern software systems are designed as highly customizable systems that can be configured according to the hardware platform, operating system, and user requirements used by the user, and can meet the user's requirements in terms of software functionality or performance.
However, the number of software system configuration items is large, some configuration items have dependency relationships, and the complexity of software system configuration makes the configuration of the software system to be adjusted to be a huge challenge. Research has shown that configuration errors of software systems have become one of the main causes of system failures and system performance problems. Configuration errors in software systems can have significant consequences, and configuration errors in commercial storage systems and open source operating systems can result in system crashes, hangs, or significant performance degradation that is difficult to diagnose.
In addition to widespread software system configuration errors, it is often difficult for a user to clearly understand the actual impact of changing a configuration item on a software system, and therefore, the user is often forced to adjust the software system configuration in a time-consuming manner with a large number of trial and error, resulting in inefficient software system configuration.
Disclosure of Invention
The invention aims to provide a method and a system for analyzing the influence of configuration items on the performance of a software system, so as to solve the technical problems of configuration errors and low configuration efficiency of the software system in the prior art.
The purpose of the invention can be realized by the following technical scheme:
a method for analyzing the influence of configuration items on the performance of a software system comprises the following steps:
identifying and marking all performance operations in the software system according to a code mode preset by the software system, wherein the performance operations are time intensive operations and/or space intensive operations which affect the performance of the software system;
identifying the dependency relationship between each performance operation and each configuration item of the software system to obtain a performance operation set corresponding to each configuration item, wherein each performance operation in the performance operation set has a dependency relationship with the configuration item;
constructing a feature vector corresponding to each configuration item according to the performance operation set;
and inputting the feature vectors corresponding to the configuration items into a trained qualitative performance influence model, and judging whether the configuration items influence the performance of the software system to obtain a configuration item set influencing the performance of the software system, wherein the qualitative performance influence model is obtained by training the feature vectors corresponding to the configuration items of a plurality of software systems.
Optionally, the qualitative performance impact model comprises:
a random forest classification model and a configuration item dependence detector;
and the random forest classification model carries out secondary classification on whether configuration items influence the performance of the software system or not, and the configuration item dependence detector corrects the classification result of the random forest classification model.
Optionally, the dependency relationship comprises:
data dependencies and control dependencies;
wherein the data dependency is a dependency between data streams, and the control dependency is a dependency caused by a program control stream.
Optionally, identifying a dependency relationship between each of the performance operations and each configuration item of the software system comprises:
identifying data dependencies between each of the performance operations and each configuration item of the software system using taint analysis;
identifying control dependencies between each of the performance operations and each configuration item of the software system using a program dependency graph; the program dependence graph is constructed by using a program dependence analysis technology and is used for describing the control dependence and the data dependence of the program.
Optionally, identifying data dependencies between the performance operations and the configuration items of the software system using taint analysis comprises:
entering a program inlet of the software system, traversing control flow, and creating a taint at a configuration item loading API as a source point;
recording a data propagation path of a source point and a finally arrived sink point, wherein the performance operation at the sink point has a data dependency relationship on the configuration item; the sink is a program statement that the source is not expected to reach, and the sink is preset before the statement corresponding to the performance operation.
Optionally, identifying a control dependency relationship between each performance operation and each configuration item of the software system using a program dependency graph comprises:
traversing all nodes in the program dependency graph, and constructing a control area of each configuration item, wherein the control area of the configuration item is a section of statement sequence which has a direct control dependency relationship with the configuration item;
identifying a control dependency between each of the performance operations and each configuration item of the software system based on the control region of each of the configuration items.
Optionally, the training process of the random forest classification model includes:
dividing feature vectors corresponding to configuration items of a plurality of software systems into a training set and a test set;
and training the random forest classification model according to the training set and a random forest algorithm.
Optionally, the modifying, by the configuration-item-dependent detector, the classification result of the random forest classification model includes:
when a first configuration item of the software system depends on a second configuration item, if the random forest classification model judges that the first configuration item influences the performance of the software system and the second configuration item does not influence the performance of the software system, the configuration item dependence detector corrects the second configuration item to influence the performance of the software system.
Optionally, before identifying the dependency relationship between each performance operation and each configuration item of the software system, the method further includes:
and extracting configuration item information of the software system, wherein the configuration item information at least comprises the name and the number of configuration items and an API (application programming interface) used when the configuration items are loaded into the software system.
The invention also provides an analysis system for the influence of the configuration items on the performance of the software system, which comprises the following steps:
the software system comprises a performance operation identification module, a code pattern detection module and a performance operation identification module, wherein the performance operation identification module is used for identifying and marking all performance operations in the software system according to a preset code pattern of the software system, and the performance operations are time intensive operations and/or space intensive operations which influence the performance of the software system;
a dependency relationship identification module, configured to identify a dependency relationship between each performance operation and each configuration item of the software system, to obtain a performance operation set corresponding to each configuration item, where each performance operation in the performance operation set has a dependency relationship with the configuration item;
the characteristic vector construction module is used for constructing a characteristic vector corresponding to each configuration item according to the performance operation set;
and the configuration item set determining module is used for inputting the feature vectors corresponding to the configuration items into a trained qualitative performance influence model, judging whether the configuration items influence the performance of the software system, and obtaining a configuration item set influencing the performance of the software system, wherein the qualitative performance influence model is obtained by utilizing the feature vectors corresponding to the configuration items of a plurality of software systems.
The invention provides a method and a system for analyzing the influence of configuration items on the performance of a software system, wherein the method comprises the following steps: identifying and marking all performance operations in the software system according to a code mode preset by the software system, wherein the performance operations are time intensive operations and/or space intensive operations which affect the performance of the software system; identifying the dependency relationship between each performance operation and each configuration item of the software system to obtain a performance operation set corresponding to each configuration item, wherein each performance operation in the performance operation set has a dependency relationship with the configuration item; constructing a feature vector corresponding to each configuration item according to the performance operation set; and inputting the feature vectors corresponding to the configuration items into a trained qualitative performance influence model, and judging whether the configuration items influence the performance of the software system to obtain a configuration item set influencing the performance of the software system, wherein the qualitative performance influence model is obtained by utilizing the feature vectors corresponding to the configuration items of a plurality of software systems for training.
Therefore, the invention has the beneficial effects that:
the method adopts a program analysis technology to track time-intensive operation or space-intensive operation which has a dependency relationship with configuration items, constructs corresponding characteristic vectors for the configuration items according to program analysis results, can fine-grained the configuration items, is not limited to Boolean types or exhaustive limited numerical value types, and supports any type of configuration items; the method has the advantages that a qualitative performance influence model is established by using a random forest, time-consuming local measurement operation is not needed, whether a specific configuration item influences the performance of the configurable system can be judged only by analyzing the source code of the software system once, the cost of performance analysis is greatly reduced, whether each configuration item of the software system influences the performance of the software system can be accurately predicted on the premise of not running the software system, a configuration item set which really influences the performance of the software system can be found, the efficiency of software system configuration can be improved, and the method is favorable for correctly configuring the software system to improve the performance of the software system. In addition, the invention also has interpretability, and the bottom layer reason of the influence of the configuration items on the performance is known through the program analysis result and the classification rule of the performance model.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of the system of the present invention;
FIG. 3 is a program dependency diagram of an example program, where solid lines represent control dependencies and dashed lines represent data dependencies;
FIG. 4 is an exemplary graph of a spot analysis of FlowDroid;
FIG. 5 is a schematic flow chart of an embodiment of the method of the present invention;
FIG. 6 is a diagram illustrating exemplary classes of functional operations and code patterns thereof in the method of the present invention;
FIG. 7 is a diagram illustrating dependency classification and exemplary configuration items for performance operations in the method of the present invention;
FIG. 8 is a schematic flow diagram of a spot analysis process of the method of the present invention;
FIG. 9 is a diagram illustrating exemplary control areas of configuration items according to the method of the present invention;
FIG. 10 is a schematic diagram of a qualitative performance impact model in the method of the present invention;
FIG. 11 is a schematic diagram of a program analysis module of the ConfigAnalyzer tool according to the present invention.
Detailed Description
Interpretation of terms:
configuration item (Option): a special type of input has a type and has a range of values, such as: the value range of a certain Boolean type configuration item is { true, false }, and the value range of a certain Integer type configuration item is {0,1,2,3 }. Configuration items allow a user to change the internal execution logic of a software system without modifying the software code, and thus the user changes the functionality or performance of the software system by changing the values of the configuration items. In some documents, configuration items are also often referred to as features or functions (features).
Configuration (Configuration): and (4) complete setting of all configuration items in the software system. All configuration items set to a certain value constitute the configuration of the software system.
Configuration Space (Configuration Space): all possible configurations in the software system constitute a configuration space.
Configurable System (Configurable System): a user is provided with a software system configured for customized operation.
Error configuration (Misconfiguration): the configuration items are set to inappropriate values that cause the behavior or performance of the software system to be outside the expected configuration. Software system errors caused by misconfigurations are called Configuration errors (Configuration errors).
Environment (Environment): the software system operates on an ensemble of hardware and software. Generally, when a software system is run, the environment in which the software system is located does not change.
Workload (Workload): the workload of tasks that the software system needs to complete within a certain time. In order to complete a predetermined task or target, a user inputs a task to the software system, and the workload of the task to be completed by the software system is larger, the workload is larger, and the computing resources required by the software system are more.
Performance (Performance): the nature of the software system to represent its operational capability is generally directly related to energy consumption and operating costs. In general, the performance is evaluated in different measurement modes under different software service quality requirements. The most intuitive way to measure the performance of a software system is the run time required to complete a task.
Performance impact Model (Performance-inflence Model): a class of models describes the performance of a software system running under a particular environment and a particular workload.
Control Flow (Control Flow): refers to the execution sequence of each statement, instruction or function call when the program runs. For Java, an imperative programming language, programs have a definite control flow (as distinguished from programs written in the declarative programming language).
Control flow Statement (Control-flow Statement): one type of program statement influences the control flow actually executed by a program according to different control flow decisions. For example, if statements, switch statements, for loop statements, while loop statements in Java are control flow statements.
Control flow Decision (Control-flow Decision): refers to controlling the actual execution of the flow statement, i.e., selecting a particular branch for execution.
Control-flow Graph (Control-flow Graph): i.e., CFG, is a type of flow chart used to represent the control flow of a program.
Data Flow (Data Flow): abstraction of data dependency chains in programs.
The embodiment of the invention provides a method and a system for analyzing the influence of configuration items on the performance of a software system, which aim to solve the technical problems of configuration errors and low configuration efficiency of the software system in the prior art.
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
(1) Configuration of modern software systems and their complexity
Today, a large number of modern software systems are designed as highly customizable systems that can be configured according to the hardware platform used by the user, the operating system, and the needs of the user. The Configuration of the software system enables users to conveniently change the behavior or state of the software system on the premise of not modifying software codes, thereby improving the flexibility and safety of the software and meeting the requirements of the users on software functions or performances. Software systems that provide configuration for users are referred to as Configurable systems (Configurable systems).
Briefly, a configuration may be represented as a collection of several configuration items (options), where each configuration item represents some property of the software, such as the hardware platform used, the operating system, whether a plug-in is loaded, etc. However, highly customizable configurations present significant challenges to users and developers while providing potential software functionality enrichment or software performance enhancements.
Research has shown that configuration errors of software systems have become one of the main causes of system failures and system performance problems. It is reported that in Google's major production service, configuration errors are the second largest cause of service level events; whereas in Facebook, configuration errors result in 16% service level events, which is considered a key challenge for Facebook reliability. Research on enterprise backup systems has shown that most of the task failures are caused by configuration errors. The consequences of configuration errors can be quite severe, and studies of configuration errors in commercial storage systems and open source operating systems have shown that a significant portion of configuration errors can lead to system crashes, hangs, or significant performance degradation that is difficult to diagnose.
In addition to widespread configuration errors, understanding the role of configuration is also a challenge. It is often difficult for a user to clearly understand the actual impact of changing a configuration item on a software system, and therefore, the user is often forced to adjust the configuration in a time-consuming manner with a large number of trial and error. For this reason, suppliers are also subjected to costs, and reports show that configuration problems are a major source of user support costs for cloud service and data center software suppliers. Meanwhile, the configuration makes the development, test, operation and maintenance processes of the software system more complicated.
To summarize, the configuration of modern software systems can meet the user's requirements in terms of software functionality or performance, but the complexity of the configuration (which is reflected in the large number of configuration items, and the interactions and dependencies between configuration items) makes it a great challenge to configure the software.
(2) Impact of configuration on software system performance
The performance of software systems (and the energy consumption and operating costs that are generally directly related) are attributes of software systems that are of great interest to both users and developers. From the perspective of a user, the user often wants the system to reduce energy consumption and operation cost on the basis of having a specific function; from the developer's perspective, developers often desire to develop systems that can be configured efficiently, providing a high quality user experience. The inventors have found that in an open source cloud system, performance issues of the software system result in about 50% release of configuration-related patches and about 30% discussion of forums related to configuration. In cloud systems, severe performance problems and interrupts resulting from mis-configuration of performance problems consume hundreds of millions of dollars.
Setting configuration items that are sensitive to the performance of a software system is a challenging task, and often requires a deep understanding of the software system. For example, setting up a configuration item may require a trade-off between memory usage and system response time, and weighing this requires an in-depth understanding of the software system, the hardware used, or the workload at hand. Snow is frosted, documents of a software system often do not have clear explanation on the relations, and even if the documents are clear, factors such as workload are often complex and changeable, so that a user is difficult to set proper configuration.
In another example, each write operation of a user may be locked when a configuration item is a particular value, resulting in an increased write latency, but the document merely states that there is no restriction on the objects supported by the write operation when the configuration item is a particular value. Unless the user traces the specific implementation logic through code, the reasons for the system performance variations cannot be understood.
(3) Prior art solutions
The performance impact model of the software system is used for expressing how the configuration impacts the system performance, and the application of the performance impact model is an important technical tool for analyzing the relationship between the configuration and the performance of the software system. Different configurations are used as the input of the performance impact model to obtain the predicted value of the performance to judge whether the configuration impacts the system performance, so the difference of the existing technical scheme lies in the mode of constructing the performance impact model.
One type of related work is to build performance impact models using black box research methods. The idea of establishing a performance model by using a black box method is as follows: the software system is treated as a black box, the configuration space is sampled to obtain a subset of configurations, the performance of the system under a specific workload is measured at each configuration of the subset of configurations, and a performance impact model is then learned from these observations.
The existing technical method adopting the black box needs to balance between modeling cost and model accuracy, more samples are needed to construct a more accurate model, more samples need to be sampled to form a larger configuration subset, the times of measuring the performance of a target software system under a specific load are more, and the time cost is larger. In addition, most of performance influence models constructed by the black box method are based on deep learning models, the interpretability is generally low, and the root cause of the change of the software system performance caused by the change of the configuration items cannot be explained.
Another type of correlation works is to use a white-box approach to build the performance impact model. The performance impact model is constructed by using a white box method, a software system is not regarded as a complete integral black box any more, but is divided into a plurality of components or modules according to the idea of program analysis, and each component or module is analyzed and modeled, so that the performance impact model of the whole system is constructed. In addition to being able to correctly predict performance, it is also possible to explain the reasons for performance appearance, e.g., performance appearance caused by those several components or modules.
The existing white box method for establishing the performance influence model has certain defects. Some white-box methods only support boolean types or exhaustive finite number types of configuration items (the exhaustive finite types of configuration items need to be discretized into a plurality of boolean type configuration items), which is a great limitation, and after discretization, the number of configuration items is greatly increased, and the running time of the tool is exponentially increased.
Some white-box methods can learn a more accurate performance impact model, but still need to prepare a software system operating environment, sample a configuration space, operate a target software system based on a sampled configuration subset, and collect various performance indexes of the software system during operation.
Although the above prior art methods have various disadvantages due to different implementation details, there is an inevitable software system running overhead, that is, these prior art methods all need to prepare a software system running environment, select and traverse configuration subsets, and measure the performance of the software system under different configuration subsets under a specific load. The significant time overhead required to repeatedly run a software system and collect the run-time performance indicators of the software system has far exceeded the time overhead required to analyze and build performance impact models.
(4) Program analysis
Program analysis (Program analysis) is a process of automatically analyzing Program performance, and the emphasis of analysis includes Program correctness, robustness, security, activity, and the like. In other words, program analysis systematically checks a program to analyze the nature of the program.
Program analysis can be divided into:
1) static program analysis: performing program analysis on the premise of not running the program;
2) dynamic program analysis: the program is run on a real or virtual processor, and program analysis is performed according to the run-time performance of the program.
Although static program analysis cannot acquire runtime information of a program, since static program analysis does not require an actual running program, a lot of time and computational resources can be saved compared to dynamic program analysis. In addition, the more information that a program analyzes to obtain is not the better, and a balance between revenue and overhead needs to be found. Therefore, the method provided by the invention uses a static program analysis technology.
(5) Stain analysis
Taint analysis (Taint analysis) is a type of program analysis, also known as Information-flow analysis, which is an analysis that detects whether any sensitive private Information in the source code can be obtained through injected vulnerabilities. Taint analysis is generally used to identify the flow of user input in a system to understand the safety impact of the system design. Taint analysis can be divided into static taint analysis and dynamic taint analysis.
Taint analysis defines a four-tuple (P, SO, SI, SA), where:
1) p represents a Program under analysis (Program);
2) the SO represents a collection of Source points (Source), which are information that represents a need for tracing.
3) SI represents a collection of sinks (Sink), which are program statements that the source is not expected to reach.
4) SA represents a collection of purifiers (saitzers) whose harmfulness is eliminated if the source point passes through the purifier during propagation.
Theorem 1: there is an information leakage vulnerability or taint flow vulnerability in a program if and only if there is a path in the program from a certain source to a certain sink and the path does not pass through any purifier.
It should be noted that a vulnerability generally refers to all sinks in the program code. In information security analysis, a vulnerability represents a place in all programs where information is leaked. In the context of leaving information secure, a vulnerability refers to all places where sensitive information is used.
In summary, in the context of performance analysis of a current configurable system, a taint flow vulnerability exists, that is, a path where configuration items affect space-intensive and time-intensive operations exists.
(6) Program dependent analysis
There are many defining ways for the control dependency and data dependency of a program, and the following adopts one of them that is more intuitive:
(6.1) control of the dependence
Definition of control dependencies: for any program branch statement S1 and program statement S2, there are:
if the sentence S1 is a branch sentence closest to the sentence S2 before the sentence S2, the sentence S1 has a plurality of branch targets, and the sentence S2 may not be executed due to a change in the branch decision of the sentence S1, the sentence S2 control dependency is referred to as the sentence S1, or the sentence S2 has a control dependency on the sentence S1, which is referred to as S2 δ c S1.
(6.2) data dependencies
Data dependencies exist between program statements that access or modify the same resource. The data dependency relationship comprises a flow dependency relationship, an anti-dependency relationship, an output dependency relationship and an input dependency relationship. Among them, stream dependency is the most basic data dependency.
(6.3) program dependence analysis and program dependence graph
The purpose of program dependency analysis is to analyze control dependencies and data dependencies in a program. In practical analysis, unlike the above-described granularity which is the definition of a statement, program dependent analysis generally takes a basic block as a minimum unit.
A Program Dependency Graph (PDG) is used to describe the control dependency and data dependency of a Program.
An example procedure is as follows:
Figure BDA0003715736630000111
FIG. 3 is a program dependency diagram of the example program described above, where the solid lines represent control dependencies and the dashed lines represent data dependencies.
(7) Random forest
Decision trees are a white-box predictive model commonly used in data mining or machine learning. The structure of the decision tree is a tree structure similar to a flow chart, in which:
each internal node represents a test for a certain attribute;
each branch represents the result of the test;
each leaf node represents a type label;
the path from the root node to the leaf node represents a classification rule. The classification rule of the decision tree is constructed by a decision tree algorithm according to the feature vector and the classification label.
Decision tree learning is a method for constructing a decision tree according to a source database, wherein an original database is continuously segmented in the process of decision tree learning, and the tree is recursively pruned until segmentation can not be carried out any more or a certain branch can be classified into a certain class. The learned decision tree is easily over-fitted to the training set.
The random forest is a classifier comprising a plurality of decision trees, and the final output class is determined by the mode of the class output by the contained decision trees. In the process of constructing the random forest, a plurality of decision trees are constructed randomly by using different parts of the training set.
Random forests, a widely used classifier, have several significant advantages:
1) under a plurality of application scenes, random forests are not easy to over-fit;
2) when random forests are used for processing high-dimensional data (namely data with a large number of characteristics), characteristic selection is generally not needed;
3) for an unbalanced classified data set, a random forest may balance the errors.
(8) And (6) root: analysis and conversion framework for Java and Android applications
The Soot is initially a Java optimization framework, and is gradually developed into an analysis, measurement, optimization and visualization framework for Java and Android applications. Briefly, the operation principle of the root is to convert an input program (Java bytecode) into an Intermediate language (IR), analyze and convert the Intermediate language, and further convert the processed code into a target language such as Java bytecode and the like for output.
Using the Soot frame, the following functions can be implemented:
building a Call graph (Call graph);
performing a pointing analysis;
build a definition, usage chain (the basis of dataflow analysis, on which data dependencies can be analyzed);
performing template-driven intra-program dataflow analysis;
performing template-driven inter-program dataflow analysis;
perform flow, field, context sensitive pointer analysis.
(9) FlowDroid: taint analysis framework for Java and Android applications
FlowDroid is a static taint analysis framework that is sensitive to the context, flow, fields, objects of Java and Android applications. The implementation of FlowDroid is based on Soot and Heros, where Heros is a general-purpose multi-threaded IDFS (inter-program limited distribution subset problem), IDE (inter-program distributed environmental problem) solver.
FlowDroid ensures sensitivity to context and flow by constructing a fairly accurate call graph, and sensitivity to fields, objects by IDFS based flow functions. Among other things, FlowDroid enables accurate and efficient alias (alias) tracking in order to ensure sensitivity to context and fields.
FIG. 4 is a practical example of FlowDroid taint analysis. In fig. 4, from 1 to 7 is a path analyzed by FlowDroid from source (source) to sink (sink) and not through the purifier, and it is easy to see that FlowDroid found z.g.f, a.g.f, b.g are alias names of x.f in the process.
Since a program often uses different names of local variables, fields in classes, global variables, etc. to refer to one and the same variable. When a program is not running, it cannot be guaranteed that a certain variable is referred to by a variable with which name. Thus, static program analysis requires alias analysis to obtain all variable names referring to a variable.
Referring to fig. 1 and fig. 5, an embodiment of a method for analyzing an influence of a configuration item on a performance of a software system according to the present invention includes:
s100: identifying and marking all performance operations in the software system according to a code mode preset by the software system, wherein the performance operations are time intensive operations and/or space intensive operations which affect the performance of the software system;
s200: identifying the dependency relationship between each performance operation and each configuration item of the software system to obtain a performance operation set corresponding to each configuration item, wherein each performance operation in the performance operation set has a dependency relationship with the configuration item;
s300: constructing a feature vector corresponding to each configuration item according to the performance operation set;
s400: and inputting the feature vectors corresponding to the configuration items into a trained qualitative performance influence model, and judging whether the configuration items influence the performance of the software system to obtain a configuration item set influencing the performance of the software system, wherein the qualitative performance influence model is obtained by training the feature vectors corresponding to the configuration items of a plurality of software systems.
The method for analyzing the influence of configuration items on the performance of a software system provided by this embodiment is a new white-box configuration performance analysis method, and step S100 identifies and marks all performance operations in the software system according to a code pattern preset by the software system.
The performance operation (PerfOp) in this embodiment refers to a time-intensive operation or a space-intensive operation, and the main difference between the time-intensive operation and the space-intensive operation is that a computer needs a long time to complete the time-intensive operation, and the computer needs a large amount of resources such as a memory and a disk to complete the space-intensive operation.
It will be appreciated that performance operations are strongly correlated with time and space consuming operations. It is worth noting that time-intensive operations and time complexity are two different concepts, and similarly, space-intensive operations and space complexity are also two different concepts. Evaluating temporal and spatial complexity requires consideration of input size at actual runtime. For example, the time complexity and the space complexity of a function f () are very small, but if an operation o needs to run the function f ()1000 times to complete, the operation o may be time-intensive or space-intensive; but the temporal complexity and spatial complexity of the operation are still dependent on the complexity of the function f (), and since the temporal complexity and spatial complexity of the function f () are excellent, the temporal complexity and spatial complexity of the operation o are also excellent. That is, the temporal complexity and spatial complexity of operation o are very small, but operation o may be a time-intensive operation or a space-intensive operation.
Referring to fig. 6, according to the observation and study of the software system, taking the Java software system as an example, the performance operation of the embodiment can be divided into four classes, and the corresponding code modes are summarized as shown in fig. 6. The performance operations in fig. 6 are divided into Java IO, thread operation, synchronous operation, and array creation, where each type of performance operation has a code mode corresponding to the performance operation, for example, the code mode corresponding to the Java IO is: invoking a method in the java.io package and invoking a method in the java.nio package.
It should be noted that, for different types of software systems, the performance operations involved are different, and the above performance operations do not necessarily cover all software systems. The method provided by the embodiment has universality and expandability, and can support a new type of performance operation as long as a new type of performance operation is defined and a code mode of the performance operation is provided.
In step S200, a dependency relationship between each performance operation and each configuration item of the software system is identified, so as to obtain a performance operation set corresponding to each configuration item, where each performance operation in the performance operation set has a dependency relationship with the configuration item.
In this embodiment, before identifying the dependency relationship between each performance operation and each configuration item of the software system, the method further includes: and extracting configuration item information of the software system, wherein the configuration item information at least comprises the name and the number of configuration items and an API (application programming interface) used when the configuration items are loaded into the software system.
In this embodiment, any configuration item that affects the performance of the software system has a data dependency or a control dependency with some performance operation. Referring to fig. 7, the configuration item in fig. 7 is OptionX, and OptionX is an abstract configuration item, and may be any configuration item in the program. In fig. 7, a data dependency relationship and a control dependency relationship exist between the performance operation of the array arr and the configuration item OptionX, where the control dependency relationship includes an if branch control dependency and a loop control dependency, and the loop control dependency includes a loop boundary control dependency and a loop step size control dependency.
It is worth noting that the control dependency and data dependency relationship between the configuration items and the performance operations of the software system are relatively independent, but are combined together: identifying control dependencies and data dependencies is in most cases independent, but in some cases requires a combination of both.
Taint analysis, which is a means of information flow tracking analysis, is essentially a data flow analysis technique that can be used to identify data dependencies of configuration items in a program. In step S200 of the present embodiment, information flow tracking is performed using taint analysis to identify data dependencies of performance operations and configuration items in a program. The flow of identifying data dependencies is shown in fig. 8, and each step is explained in detail below:
in the first step, enter the program entry.
If a program provides multiple program portals, then a virtual portal is created as the only program portal with control flow edges to all program portals.
And step two, traversing the control flow and inserting the sink mark performance operation.
And traversing control flow, identifying corresponding branch statements and performance operations through a code mode, and inserting statements for calling a sink function before the corresponding statements.
The third step: returning to the program entry, the control flow is traversed again.
The first two steps belong to preparation work (set sinks), the third step returns to program entry, the control flow is traversed again, and the program after sinks have been inserted is analyzed.
The fourth step: creating a taint (Source) at configuration load API
The configuration load API is where the earliest program entered is configured, from which we start tracing.
The fifth step: taint propagation (information flow propagation).
The spread of stains is simply: the source point is the initial taint, which marks other variables as taint during data propagation. Thus, the path traveled by the taint is the data-dependent chain.
And a sixth step: recording arriving sinks
For a configuration item that eventually travels through data from a taint created at the configuration load API to a pre-inserted sink, the information indicates: the performance operation (or branch statement) at which the sink is located has a data dependency on the configuration item.
It can be understood that, in the embodiment, the process of identifying data dependency relationships between the performance operations and the configuration items of the software system by using taint analysis specifically includes: entering a program inlet of a software system, traversing control flow, and creating a taint at a configuration item loading API as a source point; recording a data propagation path of a source point and a finally arrived sink point, wherein the performance operation at the sink point has a data dependency relationship on the configuration item; the sink is a program statement that the source point is not expected to reach, and is preset before the statement corresponding to the performance operation.
It should be noted that the control flow is the execution sequence of each statement, instruction or function in the code, and must be combined when identifying the data dependency. The execution order of the various parts of the code in the program code is known through the control flow.
In step S200 of the present embodiment, the control dependency relationship between the performance operation and the configuration item in the program is identified by constructing the control area of the configuration item.
Specifically, the control area of the configuration item is: for a certain configuration item, the control area is a statement sequence having a control dependency relationship with the configuration item, and in the control flow order, the next statement in the sequence is a direct post-domination statement in the sequence statement.
In a control flow graph, the post-domination (Postdominate) relationship refers to: for control flow nodes n, m, if all paths starting at a program entry (entry) and passing through n must pass through m to reach a program exit, then node m is called post-allocated node n. If the node m is allocated with the node n later and any other node with post-dominance n is not allocated later, the node m is called to be directly allocated with the node n later, and the node m is a direct post-dominator of the node n.
Intuitively, for a particular configuration item, the control area of the configuration item is the program that is directly controlled by the configuration item. Fig. 9 plots the areas of influence of the four configuration items OptionA, OptionB, OptionC, and OptionD.
In this embodiment, identifying the control dependency relationship between each performance operation and each configuration item of the software system by using the program dependency graph includes: traversing all nodes in the program dependency graph, and constructing a control area of each configuration item, wherein the control area of the configuration item is a statement sequence which has a direct control dependency relationship with the configuration item; and identifying the control dependency relationship between each performance operation and each configuration item of the software system according to the control area of each configuration item.
Specifically, the process of identifying the control dependency relationship between the performance operation and each configuration item of the software system is as follows:
1) constructing a program dependence graph by using a program dependence analysis technology;
2) traversing all nodes in the program dependence graph, and constructing a configuration item control area;
3) for a certain configuration item, after a configuration item control area is constructed, there are two performance operations in the configuration item control area: performance operations that are data dependent and control dependent with the configuration item; performance operations that are not data dependent but have control dependence on the configuration item.
It is worth noting that analyzing control dependencies may be used for control flow, but only a portion of the control dependencies may be analyzed using control flow alone. Specifically, only performance operations with data dependency and control dependency relation with configuration items can be analyzed by using a simple control flow, and the performance operations without data dependency relation with the configuration items and with the control dependency relation need to be completed by constructing a control area of the configuration items.
In step S300, a feature vector corresponding to each configuration item is constructed according to the performance operation set.
For any configuration item, the set of performance operations with which there is a data dependency or a control dependency is known. It is understood that the set of performance operations for a configuration item can be considered a particular set of codes, identified by a particular code pattern.
After the performance operation set corresponding to the configuration item is obtained, a corresponding feature vector is constructed for the configuration item according to the performance operation set, namely the feature vector is used for describing the dependency relationship between the configuration item and the performance operation.
The construction of the feature vector is complex, and the construction of part of the feature vector is explained as follows:
the configuration item set is set as Options, and the 4 different performance operation sets proposed in this embodiment are PerfOps. For any configuration item option e Options and performance operation perfOp e PerfOps, the set function f (option, perfOp) represents the number of performance operations perfOp having data dependency or control dependency relationship with the configuration item option in the target software system.
In addition, with perfOp data 、perfOp if 、perfOp loop And perfOp respectively representing data dependency relationship, if branch control dependency relationship and circulation control dependency relationship with the configuration item option. For k ∈ { data, if, loop }, let function g (option, perfOp }) k ) The number of perfOps for the performance of the dependency corresponding to k for the configuration item option in the target software system.
The first 22 dimensions of the feature vector v are:
Figure BDA0003715736630000181
Figure BDA0003715736630000182
Figure BDA0003715736630000183
Figure BDA0003715736630000184
Figure BDA0003715736630000185
Figure BDA0003715736630000186
Figure BDA0003715736630000187
where i and j are used only to indicate subscript counts, e.g. when i is 0, the first formula v 0 =f(option,PerfOps 0 ) Express characterThe first dimension in the eigenvector is the number of operations whose options are dependent on the first class of performance operations, Java IO.
As for
Figure BDA0003715736630000188
To explain with an example: when i is 0, it indicates that there is a data dependency relationship with the first-class performance operation Java IO. In a similar manner to that described above,
Figure BDA0003715736630000189
to explain with an example: when i is 0, indicating that an if branch control dependency relationship exists with the first-class performance operation Java IO;
Figure BDA00037157366300001810
to explain with an example: when i is 0, it indicates that a loop control dependency relationship exists with the first-class performance operation Java IO.
In step S400, the feature vectors corresponding to the configuration items are input into a trained qualitative performance influence model, and whether the configuration items influence the performance of the software system is determined, so as to obtain a configuration item set influencing the performance of the software system, where the qualitative performance influence model is obtained by training the feature vectors corresponding to the configuration items of the plurality of software systems.
In this embodiment, a qualitative performance impact model of a configuration item needs to be constructed, and then, for any brand-new target software system, after a corresponding feature vector is calculated for each configuration item of the target software system by the foregoing method, the feature vector is input into the constructed qualitative performance impact model, so that whether the configuration item affects the performance of the software system can be automatically determined, all configuration items affecting the performance of the software system are added to a configuration item set, and finally, all configuration items affecting the performance of the software system are obtained. The specific process is as follows:
first, several software systems are collected for building a training set. For each software system, the software system is run for multiple times under different configurations and the actual running time is recorded, and whether each configuration item of the software system actually affects the performance of the software system is judged. In addition, a feature vector corresponding to each configuration item in the software system is constructed, and then training set data can be obtained.
Then, the present embodiment establishes a qualitative performance impact model of the configuration item, which is composed of a random forest classification model and a cDEP configuration item dependent detector, as shown in fig. 10. The training of the random forest model uses a random forest algorithm in a sklern library, and the key point is to divide the feature vectors into a training set and a testing set for constructing the random forest classification model RandomForestClassifier.
In this embodiment, a feature vector is constructed for each configuration item, and the dimension of the feature vector is positively correlated with the number of classes in which performance operations have occurred in the target software system, so the dimension of the feature vector corresponding to the configuration item is usually high. In addition, because the functions of each software system are different, the frequency and the characteristics of performance operations of different classes of software systems (such as computation-intensive or memory-intensive) are obviously different, and the data of the constructed training set is likely to be unbalanced.
Because the random forest has the advantages of being not easy to over-fit, capable of processing high-dimensional data, capable of balancing errors of the classification data set and the like, the problems can be solved, and the random forest can be used for learning a plurality of classification rules with interpretability, the random forest classification model is adopted for carrying out secondary classification on whether the configuration items influence the performance of the software system, and the problem of influence of the configuration items on the performance of the software system is preliminarily and qualitatively answered.
The present embodiment has not considered the dependency relationship between configuration items so far, but considers each configuration item as an independent configuration item. However, in fact, there may be dependency relationships between configuration items of the software system, and when the configuration is adjusted, the configuration items with dependency relationships are usually considered together.
The present example considers that: if the configuration item Option is dependent on configuration item Option B, and the configuration item Option has an effect on software system performance, Option B also has an effect on software system performance.
cDEP is a test tool for finding configuration item dependencies, proposed by Qingrong Chen et al in 2020. In order to take the dependency relationship between the configuration items into consideration, the present embodiment further modifies and perfects the classification result of the random forest classification model by using the dependency relationship between the configuration items in the cDEP detection software system.
In this embodiment, the modifying the classification result of the random forest classification model by the configuration item dependency detector includes:
when the first configuration item of the software system depends on the second configuration item, if the random forest classification model judges that the first configuration item influences the performance of the software system and the second configuration item does not influence the performance of the software system, the cDEP configuration item dependence detector corrects the second configuration item to influence the performance of the software system.
According to the method for analyzing the influence of the configuration items on the performance of the software system, the time intensive operation or the space intensive operation which has a dependency relationship with the configuration items is tracked by adopting a program analysis technology, corresponding feature vectors are constructed for the configuration items according to program analysis results, the granularity can be fine to the configuration items, the method is not limited to Boolean types or exhaustive finite numerical value types, and any type of configuration items can be supported; the method has the advantages that a qualitative performance influence model is established by using a random forest, time-consuming local measurement operation is not needed, whether a specific configuration item influences the performance of the configurable system can be judged only by analyzing the source code of the software system once, the cost of performance analysis is greatly reduced, whether each configuration item of the software system influences the performance of the software system can be accurately predicted on the premise of not running the software system, a configuration item set which really influences the performance of the software system can be found, the efficiency of software system configuration can be improved, and the method is favorable for correctly configuring the software system to improve the performance of the software system. In addition, the embodiment of the invention has interpretability, and the bottom layer reason of the influence of the configuration item on the performance is known through the program analysis result and the classification rule of the performance model.
Referring to fig. 2, the present invention further provides an embodiment of an analysis system for analyzing an influence of a configuration item on a performance of a software system, including:
the performance operation identification module 11 is used for identifying and marking all performance operations in the software system according to a code pattern preset by the software system, wherein the performance operations are time-intensive operations and/or space-intensive operations which affect the performance of the software system;
a dependency relationship identifying module 22, configured to identify a dependency relationship between each performance operation and each configuration item of the software system, to obtain a performance operation set corresponding to each configuration item, where each performance operation in the performance operation set has a dependency relationship with the configuration item;
a feature vector construction module 33, configured to construct a feature vector corresponding to each configuration item according to the performance operation set;
and a configuration item set determining module 44, configured to input the feature vector corresponding to each configuration item into a trained qualitative performance influence model, and determine whether the configuration item influences the performance of the software system, so as to obtain a configuration item set that influences the performance of the software system, where the qualitative performance influence model is obtained by training feature vectors corresponding to configuration items of multiple software systems.
The analysis system for the influence of the configuration items on the performance of the software system, provided by the embodiment of the invention, adopts a program analysis technology to track time-intensive operation or space-intensive operation which has a dependency relationship with the configuration items, constructs corresponding feature vectors for the configuration items according to program analysis results, can fine-grained the configuration items, is not limited to Boolean types or exhaustive finite numerical value types, and supports any type of configuration items; the method has the advantages that a qualitative performance influence model is established by using a random forest, time-consuming local measurement operation is not needed, whether a specific configuration item influences the performance of the configurable system can be judged only by analyzing the source code of the software system once, the cost of performance analysis is greatly reduced, whether each configuration item of the software system influences the performance of the software system can be accurately predicted on the premise of not running the software system, a configuration item set which really influences the performance of the software system can be found, the efficiency of software system configuration can be improved, and the method is favorable for correctly configuring the software system to improve the performance of the software system. In addition, the invention also has interpretability, and the bottom layer reason of the influence of the configuration item on the performance is known through the program analysis result and the classification rule of the performance model.
In addition, according to the white-box performance analysis method based on program analysis, the invention designs and realizes a configuration analysis tool ConfigAnalyzer for Java application.
The ConfigAnalyzer tool realizes the white box performance analysis method provided by the invention and supports a Java software system. The ConfigAnalyzer uses FlowDroid to realize static taint analysis sensitive to context, flow, field and object, and performs necessary custom expansion on a Soot analysis framework based on the FlowDroid to support analysis logic required by the ConfigAnalyzer.
The ConfigAnalyzer is divided into the following two main modules:
1) a program analysis module: the identification of the dependency relationship among the marks, the configuration items and the performance operations of the performance operations is realized;
2) a performance model module: the construction of the feature vector and qualitative performance influence model of the configuration items is realized. The performance analysis module comprises two parts of constructing a characteristic vector according to a result obtained by the program analysis module and constructing a qualitative performance influence model according to the characteristic vector.
Wherein, the program analysis module of the ConfigAnalyzer tool is shown in fig. 11, and the program analysis module includes three packets, which respectively have the following functions:
1) sysu, dds, analysis: the taint analysis, the program dependence analysis, the information extraction and the like are realized;
2) ssu. dds. visual: realizing the visualization of the intermediate result;
3) ssu, dds, utility: assistance provided for the first two packages.
The information extraction is mainly used for extracting configuration item information of the software system, such as which configuration items are, the number of configuration items, APIs used when the configuration items are loaded into the software system, and the like.
The visualization of the intermediate result mainly comprises the visualization of the control area of the configuration item (as shown in fig. 9), the visualization of the insertion mark sink, and the like.
The invention refers to the evaluation mode of the research work of the existing white box performance analysis method, balances based on the evaluation effect and the workload, and selects 6 representative software systems from 19 existing real software systems to evaluate the qualitative performance influence model established by the ConfigAnalyzer. Table 1 is a target software system overview.
TABLE 1
Figure BDA0003715736630000221
It should be noted that the effective configuration refers to configurations that enable the software system to execute correctly without crashing, and these configurations enable the software system to complete corresponding tasks, but resources required for completing the tasks under different configurations are different.
Consider 5 configuration items, whose type is boolean, whose value range is { false, true }. Only 5 configuration items can constitute 2 5 32 configurations. 10 boolean type configuration items can constitute 1024 configurations. If the configuration item is not of the Boolean type, the value space is enlarged, and the total number of the configuration is very large.
In the experiment of the invention, 6 target software systems are divided into two types, data corresponding to configuration items (in common) of the four software systems of Batik, H2, Kanzi and Prevayler are used as a training set, and samples corresponding to configuration items (in common) of the two software systems of Catena and Sunflow are used as a test set.
The feature vectors constructed based on the results of the program analysis have 50 dimensions in total. A random forest regressor model is established, and then cDEP is operated to perfect the configuration dependency relationship.
In this embodiment, after the configuration item generates the feature vector by using the program analysis module of the ConfigAnalyzer tool, the feature vector is input into the random forest model, the random forest is composed of a plurality of decision trees, and the feature vector needs to pass through each decision tree and obtain a predicted classification label by using the classification rule of the decision tree.
The results are shown in Table 2, and the qualitative performance model is used for the measured valuesThe predicted impact of the software system configuration items is compared to the actual impact. Where y represents the actual classification of the configuration item, y predit The predictive classification of the configuration item by the representation model. A classification of-1 indicates that the configuration item does not affect the performance of the software system and a classification of 1 indicates that the configuration item does affect the performance of the software system.
TABLE 2
Figure BDA0003715736630000231
The experimental result shows that whether 84.21% of configuration items in the tested software system influence the performance is accurately predicted by the ConfigAnalyzer on the premise of not running the program, the accuracy is high, and the ConfigAnalyzer can really and effectively establish a qualitative performance influence model of the software system.
The invention provides an analysis system for the influence of configuration items on the performance of a software system, and realizes a ConfigAnalyzer tool, which is a Java application program-oriented configuration analysis tool.
ConfigAnalyzer first statically tracks time or space intensive operations that have dependencies with configuration items through program analysis techniques such as taint analysis, program control analysis, and the like. And then, the ConfigAnalyzer constructs a characteristic vector according to the result of program analysis, and a qualitative performance influence model is established by using a random forest.
The ConfigAnalyzer helps a user to find a configuration item set which really affects the performance of a system on the premise of not running a software system. Different from the traditional black box method, the ConfigAnalyzer has interpretability, and a user can know the bottom layer reason of the performance influenced by the configuration item through the program analysis result and the classification rule of the performance model. Different from the existing white-box method, the ConfigAnalyzer supports any type of configuration items, and because a qualitative performance influence model is established, time-consuming local measurement operation is not needed, and the analysis overhead is greatly reduced.
The configAnalyzer tool of the present invention has the following advantages:
(1) can explain
ConfigAnalyzer first statically tracks time or space intensive operations that have dependencies with configuration items through program analysis techniques such as taint analysis, program control analysis, and the like. And then, the ConfigAnalyzer constructs a characteristic vector according to the result of program analysis, and a qualitative performance influence model is established by using a random forest. Different from the traditional black box method, the ConfigAnalyzer has interpretability, and a user can know the bottom layer reason of the performance influenced by the configuration item through the program analysis result and the classification rule of the performance model.
(2) Accuracy of
The experimental result shows that the ConfigAnalyzer accurately predicts whether 84.21% of configuration items in the tested software system influence the performance on the premise of not running the program, and can effectively establish a qualitative performance influence model of the software system.
(3) Analysis of particle size
The method can accurately judge whether a specific configuration item influences the performance of the configurable system, is different from other testing methods which regard a software system as a black box, sample a configuration space to obtain a configuration subset and measure the performance of the system under each configuration of the configuration subset under a specific workload. These test methods can only determine whether one configuration affects the performance of the configurable system, and cannot achieve fine granularity to the configuration items.
(4) Efficiency and completeness
The white-box performance analysis method proposed in the existing scheme only supports boolean type or exhaustive finite number type configuration items (the exhaustive finite type configuration items need to be dispersed into a plurality of boolean type configuration items), which is a very large limitation, and after the dispersion, the number of the configuration items is increased greatly, and the operation time of the tool is increased exponentially. The invention can judge whether a specific configuration item influences the performance of the configurable system only by analyzing the source code of the configurable system once, the type of the configuration item can cover all types allowed by Java programs, is not limited to Boolean types or exhaustive finite numerical types, and completely does not need hardware equipment for supporting the operation of a software system, building an execution environment of the configurable system and considering the test overhead of the configurable system under different configurations of specific loads.
The embodiment of the invention provides a ConfigAnalyzer tool, which finds a configuration item set really influencing the system performance by analyzing the source code of a target system at one time, wherein the configuration item is not limited by type. The method is different from the traditional black box idea method for constructing an accurate software system performance influence model, the ConfigAnalyzer has interpretability, and a user can further know the root cause of each configuration item influencing the system performance by analyzing the program analysis result generated by the ConfigAnalyzer and the classification rule of the qualitative performance model and combining the analysis configuration item and the specific performance operation relation.
Compared with the existing method for constructing the performance influence model by adopting the white box idea, the ConfigAnalyzer does not need the expenses of hardware equipment, software environment, time for measuring and testing the software system, energy consumption and the like required by the operation of the configurable software system, breaks through the limitation of the type of the configuration item, and greatly reduces the cost for analyzing the performance relation between the configuration item and the software system.
The qualitative performance influence model established by the ConfigAnalyzer is evaluated through experiments, and results show that the ConfigAnalyzer accurately predicts whether 84.21% of configuration items in a tested software system influence the performance on the premise of not running a program, so that the qualitative performance influence model of the software system can be effectively established, and the ConfigAnalyzer has better accuracy performance.
It should be noted that the present invention includes, but is not limited to, the above-mentioned embodiments, as long as all technical solutions according to the concept of the present invention belong to the protection scope of the present invention, for example, the following contents also belong to the protection scope of the present invention:
(1) the static taint analysis used in the invention is replaced by dynamic taint analysis, and when the test coverage reaches a higher level (80% -99%) by combining program test, the result output by the program analysis module in the invention can be obtained and used for the input of the performance model module.
(2) The random forest classification model of the performance model module can be replaced by any classification model to classify the configuration items, so that the interpretability of part of classification results is lost probably, but the generation of the classification results is not influenced.
(3) Only by adopting configuration item sampling and program testing, whether one configuration item influences the performance of the configurable system can be judged. Sampling each configuration item, then carrying out Cartesian product operation on the sampling result of each configuration item to obtain a subset of the configuration space (at this time, the value of only one configuration item is different between the configurations in the subset, and the other configuration items except the configuration item are all set with the same value), carrying out a procedural performance test on each configuration in the subset, and judging whether one configuration item influences the performance or behavior of the configurable system by analyzing the procedural performance test result between different configurations.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for analyzing the influence of configuration items on the performance of a software system is characterized by comprising the following steps:
identifying and marking all performance operations in the software system according to a code mode preset by the software system, wherein the performance operations are time intensive operations and/or space intensive operations which affect the performance of the software system;
identifying the dependency relationship between each performance operation and each configuration item of the software system to obtain a performance operation set corresponding to each configuration item, wherein each performance operation in the performance operation set has a dependency relationship with the configuration item;
constructing a feature vector corresponding to each configuration item according to the performance operation set;
and inputting the feature vectors corresponding to the configuration items into a trained qualitative performance influence model, and judging whether the configuration items influence the performance of the software system to obtain a configuration item set influencing the performance of the software system, wherein the qualitative performance influence model is obtained by training the feature vectors corresponding to the configuration items of a plurality of software systems.
2. The method of claim 1, wherein the qualitative performance impact model comprises:
a random forest classification model and a configuration item dependence detector;
and the random forest classification model carries out secondary classification on whether configuration items influence the performance of the software system or not, and the configuration item dependence detector corrects the classification result of the random forest classification model.
3. The method of claim 1, wherein the dependency relationship comprises:
data dependencies and control dependencies;
wherein the data dependency is a dependency between data streams, and the control dependency is a dependency caused by a program control stream.
4. The method of claim 3, wherein identifying dependencies between the performance operations and the configuration items of the software system comprises:
identifying data dependencies between each of the performance operations and each configuration item of the software system using taint analysis;
identifying control dependencies between each of the performance operations and each configuration item of the software system using a program dependency graph; the program dependence graph is constructed by using a program dependence analysis technology and is used for describing the control dependence and the data dependence of the program.
5. The method of claim 4, wherein identifying data dependencies between the performance operations and the configuration items of the software system using taint analysis comprises:
entering a program inlet of the software system, traversing control flow, and creating a taint at a configuration item loading API as a source point;
recording a data propagation path of a source point and a finally arrived sink point, wherein the performance operation at the sink point has a data dependency relationship on the configuration item; the sink is a program statement that the source is not expected to reach, and the sink is preset before the statement corresponding to the performance operation.
6. The method of claim 4, wherein identifying control dependencies between the performance operations and the configuration items of the software system using a program dependency graph comprises:
traversing all nodes in the program dependency graph, and constructing a control area of each configuration item, wherein the control area of the configuration item is a section of statement sequence which has a direct control dependency relationship with the configuration item;
identifying a control dependency between each of the performance operations and each configuration item of the software system based on the control region of each of the configuration items.
7. The method for analyzing the influence of configuration items on the performance of a software system according to claim 2, wherein the training process of the random forest classification model comprises the following steps:
dividing feature vectors corresponding to configuration items of a plurality of software systems into a training set and a test set;
and training the random forest classification model according to the training set and a random forest algorithm.
8. The method for analyzing the influence of the configuration item on the performance of the software system according to claim 2, wherein the modifying the classification result of the random forest classification model by the configuration item dependent detector comprises:
when a first configuration item of the software system depends on a second configuration item, if the random forest classification model judges that the first configuration item influences the performance of the software system and the second configuration item does not influence the performance of the software system, the configuration item dependence detector corrects the second configuration item to influence the performance of the software system.
9. The method for analyzing influence of configuration items on performance of a software system according to any one of claims 1 to 8, wherein identifying dependencies between the performance operations and the configuration items of the software system further comprises:
and extracting configuration item information of the software system, wherein the configuration item information at least comprises the name and the number of configuration items and an API (application programming interface) used when the configuration items are loaded into the software system.
10. An analysis system for analyzing the impact of configuration items on the performance of a software system, comprising:
the system comprises a performance operation identification module, a performance operation identification module and a performance operation identification module, wherein the performance operation identification module is used for identifying and marking all performance operations in a software system according to a code mode preset by the software system, and the performance operations are time intensive operations and/or space intensive operations which influence the performance of the software system;
a dependency relationship identification module, configured to identify a dependency relationship between each performance operation and each configuration item of the software system, to obtain a performance operation set corresponding to each configuration item, where each performance operation in the performance operation set has a dependency relationship with the configuration item;
the characteristic vector construction module is used for constructing a characteristic vector corresponding to each configuration item according to the performance operation set;
and the configuration item set determining module is used for inputting the feature vectors corresponding to the configuration items into a trained qualitative performance influence model, judging whether the configuration items influence the performance of the software system, and obtaining a configuration item set influencing the performance of the software system, wherein the qualitative performance influence model is obtained by utilizing the feature vectors corresponding to the configuration items of a plurality of software systems.
CN202210736612.1A 2022-06-27 2022-06-27 Method and system for analyzing influence of configuration items on performance of software system Pending CN114996111A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210736612.1A CN114996111A (en) 2022-06-27 2022-06-27 Method and system for analyzing influence of configuration items on performance of software system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210736612.1A CN114996111A (en) 2022-06-27 2022-06-27 Method and system for analyzing influence of configuration items on performance of software system

Publications (1)

Publication Number Publication Date
CN114996111A true CN114996111A (en) 2022-09-02

Family

ID=83037549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210736612.1A Pending CN114996111A (en) 2022-06-27 2022-06-27 Method and system for analyzing influence of configuration items on performance of software system

Country Status (1)

Country Link
CN (1) CN114996111A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225965A (en) * 2023-04-11 2023-06-06 中国人民解放军国防科技大学 IO size-oriented database performance problem detection method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225965A (en) * 2023-04-11 2023-06-06 中国人民解放军国防科技大学 IO size-oriented database performance problem detection method
CN116225965B (en) * 2023-04-11 2023-10-10 中国人民解放军国防科技大学 IO size-oriented database performance problem detection method

Similar Documents

Publication Publication Date Title
CN109426723B (en) Detection method, system, equipment and storage medium using released memory
Velez et al. White-box analysis over machine learning: Modeling performance of configurable systems
Khan et al. A comparative study of white box, black box and grey box testing techniques
Banerjee et al. Energypatch: Repairing resource leaks to improve energy-efficiency of android apps
Kanewala et al. Predicting metamorphic relations for testing scientific software: a machine learning approach using graph kernels
Verdecchia et al. Know you neighbor: Fast static prediction of test flakiness
Habchi et al. Code smells in ios apps: How do they compare to android?
CA2774575A1 (en) System and method for display of software quality
Feitosa et al. Investigating the effect of design patterns on energy consumption
Tung et al. An integrated security testing framework for secure software development life cycle
CN112560043A (en) Vulnerability similarity measurement method based on context semantics
Efendioglu et al. Bug prediction of systemc models using machine learning
Badri et al. Predicting unit testing effort levels of classes: An exploratory study based on multinomial logistic regression modeling
Alikhashashneh et al. Using machine learning techniques to classify and predict static code analysis tool warnings
Cai et al. SENSA: Sensitivity analysis for quantitative change-impact prediction
Almogahed et al. Software security measurements: A survey
Satapathy et al. Usage of machine learning in software testing
CN114996111A (en) Method and system for analyzing influence of configuration items on performance of software system
CN111309589A (en) Code security scanning system and method based on code dynamic analysis
Pan et al. Class structure refactoring of object-oriented softwares using community detection in dependency networks
Aho et al. Automated extraction of GUI models for testing
Daian et al. Runtime verification at work: A tutorial
Niedermayr et al. Too trivial to test? An inverse view on defect prediction to identify methods with low fault risk
US8954310B2 (en) Automatic designation of equivalent variable values
Incerto et al. Inferring performance from code: a review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination