CN110928757B - Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network - Google Patents

Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network Download PDF

Info

Publication number
CN110928757B
CN110928757B CN201911163380.XA CN201911163380A CN110928757B CN 110928757 B CN110928757 B CN 110928757B CN 201911163380 A CN201911163380 A CN 201911163380A CN 110928757 B CN110928757 B CN 110928757B
Authority
CN
China
Prior art keywords
function
low
inefficient
efficiency
hdfs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911163380.XA
Other languages
Chinese (zh)
Other versions
CN110928757A (en
Inventor
杨海龙
刘一
陈鹤
李云春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201911163380.XA priority Critical patent/CN110928757B/en
Publication of CN110928757A publication Critical patent/CN110928757A/en
Application granted granted Critical
Publication of CN110928757B publication Critical patent/CN110928757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks

Abstract

The invention relates to a performance analysis method for positioning a key low-efficiency function of an HDFS (Hadoop Distributed File System) based on a Bayesian network, which is widely applied to big data application platforms such as Hadoop and Spark, and takes the HDFS (Hadoop Distributed File System) as a default Distributed File system. When the distributed file system provides support for upper-layer applications, the whole big data application is low in execution efficiency due to the inefficiency of certain functions, and the detection of the key inefficient functions is helpful for a big data application developer to improve the performance of the big data application. According to the key low-efficiency function analysis method for the HDFS, statistical analysis is carried out on function running time and I/O data volume information obtained by system instrumentation, low-efficiency probabilities of the functions are calculated, and then the key low-efficiency function optimized by the value of the HDFS source code is found out according to a Bayesian network of which the low-efficiency probabilities are established.

Description

Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network
Technical Field
The invention relates to a performance analysis method for positioning a key low-efficiency function of an HDFS (distributed file system) based on a Bayesian network, which is used for performance analysis, resource monitoring, performance bottleneck diagnosis and visualization of a big data distributed file system.
Background
In the past decades, the rapid development of internet technology and the popularization of terminals such as computers and mobile phones have led to the internet going into thousands of households, and the increase of data in exponential order has brought about, which requires the support of big data and distributed computing. Analyzing a big data storage system can be analyzed from four levels: bare devices, local file systems, distributed file systems, big data applications. Among them, the distributed file system is an important ring. The distributed computing needs to be supported by a distributed file system, so that the analysis and optimization of the performance of the distributed file system have very important research and application values. Today, widely used big data application platforms, Hadoop and Spark, all use hdfs (Hadoop Distributed File system) as a default Distributed File system. Therefore, the performance of the HDFS is optimized, and the method has very important research and application values.
In previous researches, most of the HDFS is analyzed as a black box, so that the root cause of the performance problem cannot be found by coarse-grained analysis, and therefore the performance problem of the HDFS cannot be fundamentally solved by the optimization scheme provided by the method. There are also studies to locate performance bottlenecks by Trace analysis, but the method of analysis is not deep enough and systematic. The existing HDFS analysis methods are roughly the following methods:
(1) benchmark test
For the performance analysis of the distributed file system, many tasks are to count the read-write time by adopting Benchmark or running some simple test programs. Some HDFS are subjected to performance evaluation by using a testDFSIO carried by Hadoop. Some use two typical Benchmark, TeraSort and TestDFSIO, comparing the Hadoop performance on HDFS and Lustre file systems. Some Performance Evaluation Process Algebras (PEPA) are used, a formal language is used for analyzing the Performance of the HDFS, however, only the Performance index of the level of the response time of the write operation can be obtained, the granularity is too coarse, and the Performance index cannot be deeply optimized in the HDFS system;
(2) statistical-based analysis method
The method mainly comprises the steps of inserting piles into the distributed file system, obtaining detailed bottom information, calculating time information used by a task to execute each step, and analyzing each function through a statistical method so as to find out the reasons of inefficiency. The method has the disadvantages that the method for searching the low-efficiency function is completely based on a statistical method, some methods only carry out simple time comparison, and select the low-efficiency function with longer time, so that the low-efficiency reasons obtained by the method are not comprehensive enough, and meanwhile, the method has no improved feasibility;
(3) local optimization
With respect to performance optimization of HDFS, many studies are optimized for only one point of its performance problem. For example, some methods optimize a redundancy backup strategy of the HDFS, some methods optimize the problem that the HDFS has low performance in processing small files, and some methods specially optimize name nodes of the HDFS. None of these allow for global optimization of HDFS performance from a macro level.
In summary, the prior art is too coarse in granularity or not comprehensive enough, and cannot perform performance analysis and optimization on the HDFS from the whole situation.
Disclosure of Invention
The invention solves the problems: the method overcomes the defects of the prior art, provides a performance analysis method for positioning the key low-efficiency function of the HDFS based on the Bayesian network, can position the function-level fine-grained key low-efficiency function, and has global property.
The technical scheme of the invention is as follows: a performance analysis method for positioning a key low-efficiency function of an HDFS (Hadoop distributed file system) based on a Bayes network is characterized in that function characteristic information is obtained based on HTrace instrumentation, and then the function is subjected to Bayes network and statistical reason analysis to obtain the key low-efficiency function of the HDFS, and the method has the advantages that the low-efficiency reason analysis of an HDFS layer is identified from a finer granularity, so that a user can conveniently position a performance bottleneck, and a distributed file system is improved, and specifically comprises the following steps (1) - (8):
step (1), performing function level source code instrumentation on the HDFS;
the method is characterized in that a probe, namely a code for performance acquisition, is inserted into a function in an HDFS source code when the pile is inserted, the function is shown in table 1, so that the time stamp of the function inlet and the function outlet of the HDFS to be inserted and the data volume characteristic of function reading and writing can be obtained in the running process of an application program, the code for calling the pile insertion function is inserted into the function inlet to be inserted, and a function interface used by the pile insertion is as follows:
TraceScope newPathTraceScope(String description,String path);
TABLE 1 instrumentation objective function
Figure BDA0002286756760000031
Sampling data obtained by pile insertion;
when probes are inserted into certain code segments that are short in execution time but very frequent, the performance of the program may be severely disturbed by instrumented code. Furthermore, if the execution time of the program is long and the system is large in scale, the data generated therefrom is sometimes too large to be stored and analyzed. The method comprises the steps of selecting an improved HTrace to sample, wherein a sampling method is a token bucket algorithm, parameters are set to be the size of a token bucket to be 1000, and time intervals are 180 ms;
calculating the low-efficiency probability of the function;
and (3) carrying out statistical calculation on the function execution time obtained in the step (2). For the function related to I/O, the calculated index is the read-write time of unit data, and the calculation method is as follows:
Figure BDA0002286756760000041
wherein the content of the first and second substances,
Figure BDA0002286756760000042
representing the execution time of the function f at the i-th execution,
Figure BDA0002286756760000043
the data amount read and written by the function f at the i-th execution time is represented. The function that does not involve I/O, the index calculated is the execution time, namely:
Figure BDA0002286756760000044
for each function f, calculating
Figure BDA0002286756760000045
The average of the 25% to 75% quantiles, the number of executions that exceeds the average by 1.5 times divided by the total number of executions, this ratio being the probability of inefficiency of the resulting function;
step (4), constructing a function inefficiency probability data set;
executing the application program in the steps (1) to (3) each time to obtain a group of low-efficiency probabilities of the functions as one piece of data, adopting different workloads or different data scales for different experiments, forming not less than 50 pieces of data by not less than 50 times of experiments, and uniformly integrating the experimental data, thereby constructing a data set of the low-efficiency functions;
step 5, constructing a Bayesian network by using a structure learning method through an inefficient function data set;
the nodes of the Bayesian network here represent functions, the parameters of the nodes represent the probabilities of inefficiencies of the functions, and the directed edges represent the interaction relationships of inefficient behaviors between the functions. Bayes structure learning is that a directed acyclic graph representing conditional probability is generated from a data set through a statistical method and Bayes probability calculation; the method uses a Bayesian grading strategy and a hill-climbing search method for structure learning, the Bayesian grading method is a classic and effective grading strategy, the hill-climbing method reduces the search space, reduces the complexity of the algorithm, and enables the method to be more feasible, and the formula of the structure learning is divided into two aspects of a grading function and a search strategy:
bayesian scoring:
Figure BDA0002286756760000046
g is a variable X in the variable set XiA directed acyclic graph of the probabilistic dependency relationship between, D is a sample data set,
Figure BDA0002286756760000047
for a super coefficient, i is the ith node X of the node set XiJ is Pa (X)i) K is XiThe kth value of nijkIndicating that the condition is satisfied in the dataset: xi=xik;Pa(Xi)=Pa(Xi)jNumber of instances of (c).
Climbing search:
let E be the set of all candidate edges, Δ (E) represents the change in the scoring function after adding a new edge E in the network structure (E ∈ E). Firstly, assuming that an initial network structure is an empty network, selecting a new edge E from a candidate edge set E to enable the new edge E to meet the condition delta (E) which is more than or equal to delta (E '), if the condition is met, adding E into the current network structure, deleting the edge E from the candidate edge set E, continuously searching the next edge E' meeting the condition, and if the edge meeting the condition cannot be found, stopping;
step (6) parameter learning is carried out on the basis of structure learning;
the parameter learning is the probability dependence degree of learning variables relative to father nodes to further obtain a local conditional probability distribution function, two commonly used parameter learning methods at present are a maximum likelihood estimation method and a Bayes method, when the number of records of a data set is insufficient, the calculation precision of the maximum likelihood estimation is usually not high enough, and at some moments, the calculation formula of the maximum likelihood estimation fails, and the Bayes method can effectively overcome the defects of the maximum likelihood estimation, so the invention adopts the Bayes method to carry out, and the specific formula of the parameter learning is as follows:
assuming that the prior distribution of the parameter θ is a Dirichlet distribution, the posterior probability of the parameter θ also follows the Dirichlet distribution, and the maximum posterior estimate of the parameter θ is:
Figure BDA0002286756760000051
θ={θ12,···,θndenotes node X in the networkiPa (X) relative to its parent node seti) Is compared with the conditional probability distribution table of (1),
Figure BDA0002286756760000052
is XiIs relative to Pa (X)i) An estimate of the jth value of (c),
Figure BDA0002286756760000053
for a super coefficient, i is the ith node X of the node set XiJ is Pa (X)i) K is XiThe kth value of nijkIndicating that the condition is satisfied in the dataset: xi=xik;Pa(Xi)=Pa(Xi)jThe number of instances of (c);
step (7) traversing each function, and executing the following judgment logic for all the functions, wherein the judgment logic comprises the following steps:
(7-1) judging whether the inefficiency probability of the function is larger than a preset threshold value for any traversed function flowThe preset threshold value islowSetting according to the configuration of a user; if the value is larger than the threshold value, marking the function as an inefficient function, executing (7-2), otherwise, marking the function as not an inefficient function, and continuously judging other functions;
(7-2) setting the obtained inefficiency probability of all the inefficient functions as 100%, under the premise, calculating the posterior probability of other non-inefficient function nodes by using the conditional probability distribution relation learned by the Bayesian network, and if the posterior probability exceeds the given thresholdlowConsidering that the non-low-efficiency function is changed from non-low-efficiency to low-efficiency, and counting the number of the non-low-efficiency functions which are changed from the non-low-efficiency to the low-efficiency after setting the low-efficiency probability of all the low-efficiency functions as 100 percent as NpreAnd continuing to execute (7-3);
(7-3) traversing all the inefficient functions, setting the inefficient probability of any inefficient function f to be 0%, and keeping the inefficient probability of other inefficient functions to be 100%, on the premise that the inefficient probability of any inefficient function f is set to be 0%, calculating the posterior probability of all non-inefficient function nodes, if the posterior probability exceeds a given threshold value, considering that the non-inefficient function is changed from non-inefficient to inefficient, and recording the number of the functions changed from non-inefficient to inefficient as
Figure BDA0002286756760000061
Judgment of
Figure BDA0002286756760000062
Whether it is greater than preset thresholdnodeThe preset threshold value isnodeAccording to the configuration setting of a user, if the value is larger than the preset value, the function f is considered to be a key low-efficiency function, and the function f is executed (7-4), otherwise, the function is not the key low-efficiency function, and other low-efficiency functions are continuously judged;
(7-4) traversing all the inefficient functions, calculating the proportion of the running time of the inefficient functions to the total running time of the program, and judging whether the proportion is greater than a preset threshold value thresholdweightThe preset threshold value isweightAccording to configuration setting of a user, if the value of the function is larger than that of the key low-efficiency function, the function is a key low-efficiency function worthy of optimization, otherwise, the function is not worthy of optimization;
and (8) displaying a key inefficiency function worthy of optimization and performing cause analysis.
Further, in the performance analysis method for positioning the key low-efficiency function of the HDFS based on the Bayesian network, the source code instrumentation of the HDFS in the step (1) is Trace information acquired by expanding the HTrace. All communication Trace information among name nodes, data nodes and client nodes in the HDFS is obtained, and parameter information of functions including data scale of data sending, receiving, reading and writing operations is collected, so that sufficient data support is provided for further analysis.
Further, in the performance analysis method for positioning the key inefficient function of the HDFS based on the bayesian network, there is no clear definition for the inefficient function in the performance detection of the HDFS, and the inefficient function probability described in this patent has the following calculation formula:
Figure BDA0002286756760000063
wherein p islow(f) The corresponding inefficiency probability of the function f; flow(f) The number of times of inefficient execution of the function f is specifically calculated by performing instrumentation to obtain all the execution times of the function f or the unit data read-write time
Figure BDA0002286756760000064
Among them, 25% and 75% of the quantiles were found to be within this range
Figure BDA0002286756760000065
Average and mark as E (t)f),Flow(f) For all executions of function f, satisfy
Figure BDA0002286756760000066
Number of executions of, Fall(f) Is the total number of executions of the function f.
Further, in the performance analysis method for positioning the key low-efficiency functions of the HDFS based on the Bayesian network, the total number of the low-efficiency functions needs to be obtained in the step (7-1), and the judgment standard is a preset threshold value thresholdlowThe calculation formula for counting the number of the original low-efficiency functions is as follows:
Nodef=1,plow(f)>thresholdlow
=0,plow(f)≤thresholdlow
Nlow=ΣNodef
Nlowrepresenting the total number of inefficient functions, threshold, calculated according to the statistical result before the Bayesian network is establishedlowNode representing a preset threshold for determining whether the function is inefficientfFor the intermediate variable, it is determined whether the probability of inefficiency of the function f is greater than an expected threshold.
Further, in the performance analysis method for locating the key inefficient function of the HDFS based on the bayesian network, in the step (7-2), the process of converting the non-inefficient function into the inefficient function is as follows: taking the probability of the low-efficiency function as 1 as an initial condition, reasoning on the network, changing the posterior probability of the non-low-efficiency function after reasoning, and if the changed posterior probability exceeds a preset threshold value thresholdlowThen the non-inefficient function is deemed to be transformed to be inefficient. The number of functions which are converted from non-inefficient to inefficient is recorded as Npre
Further, in the performance analysis method for positioning the key inefficient functions of the HDFS based on the Bayesian network, the number of the inefficient functions converted from the non-inefficient functions in the step (7-3) is less than the number of the inefficient functions
Figure BDA0002286756760000071
The number of functions that are converted from non-inefficient to inefficient after reasoning on the bayesian network on the premise that the probability of the inefficient function f is reduced from 100% to 0% based on step (7-2) is shown. The inefficiency function f that satisfies the following condition is considered to be a key inefficiency function:
Figure BDA0002286756760000072
among them, thresholdnodeThe preset threshold value is specified by user configuration.
Further, in the performance analysis method for positioning the key inefficient functions of the HDFS based on the bayesian network, the step (7-4) is a method for calculating the actual running time of the function, which is a method for determining the running time of the function as a proportion of the total running time of the program, and includes the following three methods:
the first mode is as follows:
Figure BDA0002286756760000073
wherein the content of the first and second substances,
Figure BDA0002286756760000074
representing the result of the actual running time sought by the function f,
Figure BDA0002286756760000075
represents the average of all execution times of the function f within the interval 25% to 75%,
Figure BDA0002286756760000076
representing the average value of the running times of the function f at all the nodes;
the second mode is as follows:
Figure BDA0002286756760000077
wherein the content of the first and second substances,
Figure BDA0002286756760000078
representing the result of the actual running time sought by the function f,
Figure BDA0002286756760000079
represents the sum of all execution times of the function f,
Figure BDA00022867567600000710
representing the total number of nodes on which the function f runs;
the third mode is as follows:
Figure BDA0002286756760000081
wherein the content of the first and second substances,
Figure BDA0002286756760000082
representing the result of the actual running time sought by the function f,
Figure BDA0002286756760000083
representing the total execution time of the function f on the node, because the Trace information obtained by instrumentation contains the IP address of the node where the function is executed each time, the method can not only solve the problem of the total execution time of the function f on the node, but also can be used for solving the problem of the total execution time of the function f on the node, and can also be used for
Figure BDA0002286756760000084
Can be simply summed from this information.
In some cases, the determination is made by selecting a method suitable for calculating the time ratio, and in some cases, the determination may be made by using a plurality of methods at the same time.
Compared with the prior art, the invention has the advantages that:
(1) the performance analysis and positioning of the HDFS are carried out on the function level granularity, the performance analysis is carried out on the basis of the function characteristics obtained by the method from the pile insertion to the HDFS, the granularity is fine, and the analysis can be carried out in a deep system.
(2) Because the manual selection of the pile inserting site is adopted, the related pile inserting position can be detailed and perfected to a certain extent, and the method is more comprehensive compared with a method for analyzing based on a specific performance problem.
Drawings
FIG. 1 is a schematic diagram of a system architecture for implementing the performance analysis method for locating the key inefficient function of the HDFS based on the Bayesian network according to the present invention;
FIG. 2 is a flowchart of a performance analysis method for locating a critical inefficiency function of the HDFS based on the Bayesian network according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the method is to extract the running time characteristics of the functions to calculate the low efficiency probability of each function, generate a Bayesian network of the functions according to the low efficiency probability of the functions, and screen the low efficiency functions according to the time characteristics, so as to obtain the key low efficiency functions.
Fig. 1 is a schematic diagram of a system architecture for implementing the performance analysis method for locating the key inefficient function of the HDFS based on the bayesian network according to the present invention. The HDFS node provides Trace for the system, the Trace merging module merges the Trace, the Trace processing module structurizes function calling information, the Bayesian network construction module structurally expresses the Bayesian network and then gives the structural expression to the Bayesian network inference module for probability inference, and the visualization module is responsible for showing key low-efficiency functions to users.
FIG. 2 is a flow chart of a performance analysis method for locating a key inefficiency function of an HDFS based on a Bayesian network, the detailed flow includes steps (1) to (8):
step (1), performing function level source code instrumentation on the HDFS;
the method is characterized in that a probe, namely a code for performance acquisition, is inserted into a function in an HDFS source code when the pile is inserted, the function is shown in table 1, so that the time stamp of the function inlet and the function outlet of the HDFS to be inserted and the data volume characteristic of function reading and writing can be obtained in the running process of an application program, the code for calling the pile insertion function is inserted into the function inlet to be inserted, and a function interface used by the pile insertion is as follows:
TraceScope newPathTraceScope(String description,String path);
TABLE 1 instrumentation objective function
Figure BDA0002286756760000091
Figure BDA0002286756760000101
Sampling data obtained by pile insertion;
when probes are inserted into certain code segments that are short in execution time but very frequent, the performance of the program may be severely disturbed by instrumented code. Furthermore, if the execution time of the program is long and the system is large in scale, the data generated therefrom is sometimes too large to be stored and analyzed. The method comprises the steps of selecting an improved HTrace to sample, wherein a sampling method is a token bucket algorithm, parameters are set to be the size of a token bucket to be 1000, and time intervals are 180 ms;
calculating the low-efficiency probability of the function;
and (3) carrying out statistical calculation on the function execution time obtained in the step (2). For the function related to I/O, the calculated index is the read-write time of unit data, and the calculation method is as follows:
Figure BDA0002286756760000102
wherein the content of the first and second substances,
Figure BDA0002286756760000103
representing the execution time of the function f at the i-th execution,
Figure BDA0002286756760000104
the data amount read and written by the function f at the i-th execution time is represented. The function that does not involve I/O, the index calculated is the execution time, namely:
Figure BDA0002286756760000105
for each function f, calculating
Figure BDA0002286756760000106
The average of the 25% to 75% quantiles, the number of executions that exceeds the average by 1.5 times divided by the total number of executions, this ratio being the probability of inefficiency of the resulting function;
step (4), constructing a function inefficiency probability data set;
executing the application program in the steps (1) to (3) each time to obtain a group of low-efficiency probabilities of the functions as one piece of data, adopting different workloads or different data scales for different experiments, forming not less than 50 pieces of data by not less than 50 times of experiments, and uniformly integrating the experimental data, thereby constructing a data set of the low-efficiency functions;
step 5, constructing a Bayesian network by using a structure learning method through an inefficient function data set;
the nodes of the Bayesian network here represent functions, the parameters of the nodes represent the probabilities of inefficiencies of the functions, and the directed edges represent the interaction relationships of inefficient behaviors between the functions. Bayes structure learning is that a directed acyclic graph representing conditional probability is generated from a data set through a statistical method and Bayes probability calculation; the method uses a Bayesian grading strategy and a hill-climbing search method for structure learning, the Bayesian grading method is a classic and effective grading strategy, the hill-climbing method reduces the search space, reduces the complexity of the algorithm, and enables the method to be more feasible, and the formula of the structure learning is divided into two aspects of a grading function and a search strategy:
bayesian scoring:
Figure BDA0002286756760000111
g is a variable X in the variable set XiA directed acyclic graph of the probabilistic dependency relationship between, D is a sample data set,
Figure BDA0002286756760000112
for a super coefficient, i is the ith node X of the node set XiJ is Pa (X)i) K is XiThe kth value of nijkIndicating that the condition is satisfied in the dataset: xi=xik;Pa(Xi)=Pa(Xi)jNumber of instances of (c).
Climbing search:
let E be the set of all candidate edges, Δ (E) represents the change in the scoring function after adding a new edge E in the network structure (E ∈ E). Firstly, assuming that an initial network structure is an empty network, selecting a new edge E from a candidate edge set E to enable the new edge E to meet the condition delta (E) which is more than or equal to delta (E '), if the condition is met, adding E into the current network structure, deleting the edge E from the candidate edge set E, continuously searching the next edge E' meeting the condition, and if the edge meeting the condition cannot be found, stopping;
step (6) parameter learning is carried out on the basis of structure learning;
the parameter learning is the probability dependence degree of learning variables relative to father nodes to further obtain a local conditional probability distribution function, two commonly used parameter learning methods at present are a maximum likelihood estimation method and a Bayes method, when the number of records of a data set is insufficient, the calculation precision of the maximum likelihood estimation is usually not high enough, and at some moments, the calculation formula of the maximum likelihood estimation fails, and the Bayes method can effectively overcome the defects of the maximum likelihood estimation, so the invention adopts the Bayes method to carry out, and the specific formula of the parameter learning is as follows:
assuming that the prior distribution of the parameter θ is a Dirichlet distribution, the posterior probability of the parameter θ also follows the Dirichlet distribution, and the maximum posterior estimate of the parameter θ is:
Figure BDA0002286756760000113
θ={θ12,···,θndenotes node X in the networkiPa (X) relative to its parent node seti) Is compared with the conditional probability distribution table of (1),
Figure BDA0002286756760000114
is XiIs relative to Pa (X)i) An estimate of the jth value of (c),
Figure BDA0002286756760000115
for a super coefficient, i is the ith node X of the node set XiJ is Pa (X)i) K is XiThe kth value of nijkIndicating that the condition is satisfied in the dataset: xi=xik;Pa(Xi)=Pa(Xi)jThe number of instances of (c);
step (7) traversing each function, and executing the following judgment logic for all the functions, wherein the judgment logic comprises the following steps:
(7-1) judging whether the inefficiency probability of the function is larger than a preset threshold value for any traversed function flowThe preset threshold value islowSetting according to the configuration of a user; if the value is larger than the threshold value, marking the function as an inefficient function, executing (7-2), otherwise, marking the function as not an inefficient function, and continuously judging other functions;
(7-2) setting the obtained inefficiency probability of all the inefficient functions as 100%, under the premise, calculating the posterior probability of other non-inefficient function nodes by using the conditional probability distribution relation learned by the Bayesian network, and if the posterior probability exceeds a given threshold thresholdlowConsidering that the non-low-efficiency function is changed from non-low-efficiency to low-efficiency, and counting the number of the non-low-efficiency functions which are changed from the non-low-efficiency to the low-efficiency after setting the low-efficiency probability of all the low-efficiency functions as 100 percent as NpreAnd continuing to execute (7-3);
(7-3) traversing all the inefficient functions, setting the inefficient probability of any inefficient function f to be 0%, and keeping the inefficient probability of other inefficient functions to be 100%, on the premise that the inefficient probability of any inefficient function f is set to be 0%, calculating the posterior probability of all non-inefficient function nodes, if the posterior probability exceeds a given threshold value, considering that the non-inefficient function is changed from non-inefficient to inefficient, and recording the number of the functions changed from non-inefficient to inefficient as
Figure BDA0002286756760000121
Judgment of
Figure BDA0002286756760000122
Whether it is greater than preset thresholdnodeThe preset threshold value isnodeAccording to the configuration setting of a user, if the value is larger than the preset value, the function f is considered to be a key low-efficiency function, and the function f is executed (7-4), otherwise, the function is not the key low-efficiency function, and other low-efficiency functions are continuously judged;
(7-4) traversing all the inefficient functions, calculating the proportion of the running time of the inefficient functions to the total running time of the program, and judging whether the proportion is greater than a preset threshold value thresholdweightThe preset threshold value isweightAccording to configuration setting of a user, if the value of the function is larger than that of the key low-efficiency function, the function is a key low-efficiency function worthy of optimization, otherwise, the function is not worthy of optimization;
and (8) displaying a key inefficiency function worthy of optimization and performing cause analysis.
The invention has not been described in detail and is within the skill of the art.
The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (5)

1. A performance analysis method for positioning a key low-efficiency function of an HDFS (Hadoop distributed File System) based on a Bayesian network is characterized by comprising the following steps of:
step (1): performing function level instrumentation on a specific target function under DFSInputStream, BlockReaderLocalLegacy, BlockReaderRemote2, BlockReaderLocal, BlockReaderRemote, BlockSender, BlockReceiver, FSNameSystems, DFSCLient, DistributedFilesystem classes of a distributed file system HDFS to obtain time stamps of function inlets and function outlets of the instrumented functions of the HDFS and data amount read and written by the functions as data obtained by the instrumentation;
step (2): sampling data obtained by pile insertion by using the improved HTrace to obtain data with equal pile insertion function times;
and (3): calculating the low-efficiency probability of all the pile inserting functions to obtain the low-efficiency probability of each pile inserting function;
and (4): repeating the steps (1) to (3) as a one-time process for 50-100 times, taking the one-time process as one piece of data, and constructing a function inefficiency probability data set by using 50-100 pieces of data;
and (5): constructing a Bayesian network by adopting a structure learning method based on the low-efficiency function data set, wherein the structure learning method is a Bayesian grading strategy and a hill-climbing search method;
step (6) parameter learning is carried out on the basis of the structure learning method to obtain a local conditional probability distribution function, and the specific algorithm of the parameter learning is a Bayes method;
and (7): traversing each pile inserting function obtained in the step (3), and carrying out sensitivity analysis on each pile inserting function to obtain a key low-efficiency function;
the HDFS source code instrumentation in step (1) is Trace information acquired by expanding httrace, acquires all communication Trace information among name nodes, data nodes, and client nodes in the HDFS, and acquires parameter information of an instrumentation function, where the instrumentation function is an instrumentation target function: constructors of DFSInputStream class, openInfo, fetchBlockAt, constructors of readWithStratage, actualGetFromOneDataNode, BlockReaderLocalLegacy/BlockReaderRemote2/BlockReaderLocal/BlockReaderRemote class, readFully, readAll, sendPage or transferTo of BlockSender class, blockSender # sendPacket or readLocal, BlockSender \ sendPacket or WeToSot, docSendBlock, FlushOrSync of BlockReceivePack, receivePack, receiveBlockBuck, getLocats of FSNameckSystems, remetLocato, retrieveTo, getLogetLogetLortent, Setletestentry, SetlementLortende, SetlertLogetLortegetLortement, SetlementLortement, SetletionSedelockLocate, SegetLocate, Se;
the step (7) specifically comprises the following steps:
(7-1) judging whether the inefficiency probability of the function is larger than a preset threshold value for any traversed function flowThe preset threshold value islowSetting according to the configuration of a user; if the value is larger than the threshold value, marking the function as an inefficient function, executing (7-2), otherwise, marking the function as not an inefficient function, and continuously judging other functions;
(7-2) setting the obtained inefficiency probability of all the inefficient functions as 100%, calculating the posterior probability of other non-inefficient function nodes by utilizing the conditional probability distribution relation learned by the Bayesian network, and if the posterior probability exceeds the given thresholdlowConsidering that the non-low-efficiency function is changed from non-low-efficiency to low-efficiency, and counting the number of the non-low-efficiency functions which are changed from the non-low-efficiency to the low-efficiency after setting the low-efficiency probability of all the low-efficiency functions as 100 percent as NpreAnd continuing to execute (7-3);
(7-3) traversing all the inefficient functions, setting the inefficient probability of any inefficient function f to be 0%, and keeping the inefficient probability of other inefficient functions to be 100%, on the premise that the inefficient probability of any inefficient function f is set to be 0%, calculating the posterior probability of all non-inefficient function nodes, if the posterior probability exceeds a given threshold value, considering that the non-inefficient function is changed from non-inefficient to inefficient, and recording the number of the functions changed from non-inefficient to inefficient as
Figure FDA0002920509630000021
Judgment of
Figure FDA0002920509630000022
Whether it is greater than preset thresholdnodeThe preset threshold value isnodeAccording to the configuration setting of a user, if the value is larger than the preset value, the function f is considered to be a key low-efficiency function, and the function f is executed (7-4), otherwise, the function is not the key low-efficiency function, and other low-efficiency functions are continuously judged;
(7-4) traversing all the inefficient functions, calculating the proportion of the running time of all the inefficient functions to the total running time of the program, namely calculating the actual running time of the functions, and judging whether the proportion is greater than a preset threshold value thresholdweightThe preset threshold value isweightAccording to the configuration setting of a user, if the value is smaller than a threshold value, the function is not worth optimizing; if so, the function is a key inefficiency worth optimizing.
2. The performance analysis method for locating the key inefficiency function of the HDFS based on the bayesian network according to claim 1, wherein: in the step (2), sampling is performed by adopting an improved HTrace, namely, a distributed system tracking frame from Cloudera open source, the sampling method is a token bucket algorithm, the parameter is set to be the size of a token bucket of 1000, and the time interval is 180 ms.
3. The performance analysis method for locating the key inefficiency function of the HDFS based on the bayesian network according to claim 1, wherein: in the step (3), in the performance detection of the HDFS, the probability calculation formula of the inefficient function is as follows:
Figure FDA0002920509630000023
wherein p islow(f) The corresponding inefficiency probability of the function f; flow(f) The number of times of inefficient execution of the function f is specifically calculated by performing instrumentation to obtain all the execution times of the function f or the unit data read-write time
Figure FDA0002920509630000031
Among them, 25% and 75% of the quantiles were found to be within this range
Figure FDA0002920509630000032
Average and mark as E (t)f),Flow(f) For all executions of function f, satisfy
Figure FDA0002920509630000033
Number of executions of, Fall(f) Is the total number of executions of the function f.
4. The performance analysis method for locating the key inefficiency function of the HDFS based on the bayesian network according to claim 1, wherein: in the step (3), the method for calculating the inefficiency probability of each instrumentation function is as follows: subtracting the time stamp of the function outlet from the time stamp of the function inlet in the step (1) to obtain the execution time of the function, and performing statistical calculation, wherein for the function related to I/O in the HDFS system, the calculated index is the unit data read-write time, and the calculation method comprises the following steps:
Figure FDA0002920509630000034
wherein the content of the first and second substances,
Figure FDA0002920509630000035
representing the execution time of the function f at the i-th execution,
Figure FDA0002920509630000036
the data volume read and written by the function f in the ith execution time is represented, the function does not relate to I/O, and the calculated index is the execution time, namely:
Figure FDA0002920509630000037
for each function f, calculating
Figure FDA0002920509630000038
The average between 25% and 75% quantiles, the number of executions that exceeds the average by a factor of 1.5 divided by the total number of executions, is the probability of inefficiency of the resulting function.
5. The performance analysis method for locating the key inefficiency function of the HDFS based on the bayesian network according to claim 1, wherein: in the step (7-4), the method for the proportion of the function running time to the total program running time is any one of the following three methods:
the first mode is as follows:
Figure FDA0002920509630000039
wherein the content of the first and second substances,
Figure FDA00029205096300000310
representing the result of the actual running time sought by the function f,
Figure FDA00029205096300000311
represents the average of all execution times of the function f within the interval 25% to 75%,
Figure FDA00029205096300000312
representing the average value of the running times of the function f at all the nodes;
the second mode is as follows:
Figure FDA00029205096300000313
wherein the content of the first and second substances,
Figure FDA00029205096300000314
representing the result of the actual running time sought by the function f,
Figure FDA00029205096300000315
represents the sum of all execution times of the function f,
Figure FDA0002920509630000041
representing the total number of nodes on which the function f runs;
the third mode is as follows:
Figure FDA0002920509630000042
wherein the content of the first and second substances,
Figure FDA0002920509630000043
representing the result of the actual running time sought by the function f,
Figure FDA0002920509630000044
representing the total execution time of the function f on the node, because the Trace information obtained by instrumentation contains the IP address of the node where the function is executed each time,
Figure FDA0002920509630000045
simply summed from this information.
CN201911163380.XA 2019-11-25 2019-11-25 Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network Active CN110928757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911163380.XA CN110928757B (en) 2019-11-25 2019-11-25 Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911163380.XA CN110928757B (en) 2019-11-25 2019-11-25 Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network

Publications (2)

Publication Number Publication Date
CN110928757A CN110928757A (en) 2020-03-27
CN110928757B true CN110928757B (en) 2021-03-23

Family

ID=69851667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911163380.XA Active CN110928757B (en) 2019-11-25 2019-11-25 Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network

Country Status (1)

Country Link
CN (1) CN110928757B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110262954A (en) * 2019-06-21 2019-09-20 北京航空航天大学 Method based on the automatic learning system reliability model of Condition Monitoring Data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3072045A1 (en) * 2017-08-02 2019-02-07 Strong Force Iot Portfolio 2016, Llc Methods and systems for detection in an industrial internet of things data collection environment with large data sets
CN110032463B (en) * 2019-03-01 2024-02-06 创新先进技术有限公司 System fault positioning method and system based on Bayesian network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110262954A (en) * 2019-06-21 2019-09-20 北京航空航天大学 Method based on the automatic learning system reliability model of Condition Monitoring Data

Also Published As

Publication number Publication date
CN110928757A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
Al Dallal Object-oriented class maintainability prediction using internal quality attributes
Arisholm et al. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models
Pol et al. Unstable taxa in cladistic analysis: identification and the assessment of relevant characters
González et al. Validation methods for plankton image classification systems
Nesi et al. Effort estimation and prediction of object-oriented systems
US20180082215A1 (en) Information processing apparatus and information processing method
CN114026536A (en) Machine learning retraining
Vashani et al. DB 2020: Analyzing and forecasting design-build market trends
EP3688616A1 (en) Learning the structure of hierarchical extraction models
Strüder et al. Feature-oriented defect prediction
Nagwani et al. A data mining model to predict software bug complexity using bug estimation and clustering
US20190392331A1 (en) Automatic and self-optimized determination of execution parameters of a software application on an information processing platform
CN111160959A (en) User click conversion estimation method and device
CN109933515B (en) Regression test case set optimization method and automatic optimization device
Pouchard et al. Prescriptive provenance for streaming analysis of workflows at scale
CN110928757B (en) Performance analysis method for positioning HDFS (Hadoop distributed File System) key low-efficiency function based on Bayesian network
CN116401232B (en) Database parameter configuration optimization method and device, electronic equipment and storage medium
Quintana et al. ALDI++: Automatic and parameter-less discord and outlier detection for building energy load profiles
Dai et al. Core decomposition on uncertain graphs revisited
Zhou et al. Predicting concurrency bugs: how many, what kind and where are they?
CN116301745A (en) Micro-service dividing method and device based on programming framework field knowledge
CN106844218B (en) Evolution influence set prediction method based on evolution slices
BenIdris et al. Prioritizing software components risk: Towards a machine learning-based approach
Shao et al. Improving iForest for hydrological time series anomaly detection
Li et al. WiBB: an integrated method for quantifying the relative importance of predictive variables

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant