CN109726120B

CN109726120B - Software defect confirmation method based on machine learning

Info

Publication number: CN109726120B
Application number: CN201811477275.9A
Authority: CN
Inventors: 柯文俊; 刘悦悦; 江山; 李雅斯; 王坤龙
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2022-03-08
Anticipated expiration: 2038-12-05
Also published as: CN109726120A

Abstract

The invention relates to a software defect confirmation method based on machine learning, which comprises the following steps: the method comprises the following steps: constructing a feature vector; step two: the defect code knowledge base construction based on cluster analysis comprises the following steps: inputting a defect code feature vector set as a data set, and clustering; performing cluster integration on a data set, firstly generating a plurality of clustering results, and then integrating the clusters; the method comprises the steps of collecting a plurality of clustering results and integrating the clustering results; forming a defect code knowledge base sample; step three: supervised learning based defect code validation, comprising: taking the obtained defect code knowledge base sample as input, constructing a multi-class classifier, and judging whether the classifier meets evaluation indexes or not by using a test sample; if the evaluation index is not met, a cost function is introduced to carry out iterative optimization on the classifier until the evaluation index is met. The invention completes the separation work of the false alarm defect and the non-false alarm defect, and achieves the purposes of accurately confirming the software defect and improving the testing efficiency.

Description

Software defect confirmation method based on machine learning

Technical Field

The invention relates to a software technology, in particular to a software defect confirmation method based on machine learning.

Background

With the increasing complexity of software and the increasing amount of code, the defect detection and confirmation of software become more and more important. The traditional software static analysis is a process of searching errors possibly existing in a code or evaluating the program code without executing the program code, and data flow, control flow and the like of the program are analyzed by scanning the text of the program code, so that the design of the system meets the requirements of modularization, structurization and object orientation, and the reliability of the code is improved by monitoring the standard and quality of the code.

Existing software static analysis is often an approximation-based analysis method that provides information that is not always accurate. The program is not actually executed, but is analyzed by static scanning of the code. The method of manually judging the detection result is not enough to meet the requirement of high-speed development of software in the future.

Disclosure of Invention

The present invention aims to provide a software defect confirmation method based on machine learning, which is used for solving the problems of the prior art.

The invention relates to a software defect confirmation method based on machine learning, which comprises the following steps: the method comprises the following steps: constructing a feature vector, comprising: firstly, extracting defect code segments in a defect code set one by one, filtering the defect code segments into minimized defect code segments by adopting code filtering based on slice analysis, then converting the code segments into an abstract syntax tree by using a syntax analysis tree method, selecting a proper C language keyword set to form a characteristic matrix of a plurality of lines of codes according to different code rules, and finally obtaining a defect code characteristic vector set for subsequent machine learning according to a characteristic matrix merging method; step two: the defect code knowledge base construction based on cluster analysis comprises the following steps: inputting a defect code feature vector set as a data set, and clustering; performing cluster integration on a data set, firstly generating a plurality of clustering results, and then integrating the clusters; the method comprises the steps of collecting a plurality of clustering results and integrating the clustering results; forming a defect code knowledge base sample; step three: supervised learning based defect code validation, comprising: taking the obtained defect code knowledge base sample as input, constructing a multi-class classifier, and judging whether the classifier meets evaluation indexes or not by using a test sample; if the evaluation index is not met, a cost function is introduced to carry out iterative optimization on the classifier until the evaluation index is met.

The invention provides a software defect confirming method based on machine learning, which takes the detection result of a software static analysis tool as input, firstly, extracts a minimized defect code segment corresponding to a defect code line by a slice analysis method, and constructs a minimized defect code characteristic vector based on a syntax tree; then, clustering analysis is carried out to construct a defect code knowledge base through feature selection and clustering integration technologies according to the feature vectors; and finally, constructing a software defect code confirmation model based on a defect code knowledge base and a supervised learning method, training, continuously optimizing the model until the specified accuracy is reached, completing the separation work of the false alarm defect and the non-false alarm defect, and achieving the purposes of accurately confirming the software defect and improving the testing efficiency.

Drawings

FIG. 1 shows a flow of a code feature vector construction method;

FIG. 2 is an exemplary diagram of a simple program static slice;

FIG. 3 is a diagram illustrating a process for constructing a syntax abstraction tree for an example code;

FIG. 4 is a schematic diagram of a clustering integration process;

FIG. 5 is a schematic diagram of a pair of other method classification problems;

FIG. 6 illustrates a four-class problem DAG structure.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention relates to a software defect confirmation method based on machine learning, which comprises the following steps:

the method comprises the following steps: constructing a feature vector;

fig. 1 shows a process of a code feature vector construction method, and as shown in fig. 1, the code feature vector construction mainly uses a slice analysis-based code filtering technology and a syntax analysis tree-based feature vector construction method, the method includes firstly extracting defect code segments in a defect code set one by one, filtering the defect code segments into minimized defect code segments by using slice analysis-based code filtering, then converting the code segments into abstract syntax trees by using a syntax analysis tree method, selecting a proper C language keyword set to form a feature matrix of a multi-line code according to different code rules, and finally obtaining a quantized and accurately described defect code feature vector set for subsequent machine learning according to a feature matrix merging method.

1. The slice analysis based code filtering includes: the method comprises the steps of adopting a static slicing method to carry out slicing analysis on defect codes, firstly extracting a corresponding group of concerned variables from a section of defect code segment (one or more sentences), then filtering out irrelevant codes according to a static backward slicing criterion, extracting sentences which influence the concerned group of variable values in source codes, and forming a new code segment, namely obtaining a minimized defect code segment related to the variable values. And providing support for feature vector construction of subsequent software codes.

Fig. 2 is a diagram of a simple example of static slicing of a program, and as shown in fig. 2, a static slicing algorithm based on a System Dependency Graph (SDG) is adopted according to a static backward slicing criterion to implement slicing analysis-based code filtering. The SDG static slicing algorithm contains control and data dependencies and procedure call relationships in a single structure. Point of interest p is the statement "system.

As shown in fig. 1, 2, the feature vector construction based on the parse tree includes:

the characteristic vector construction based on the syntactic analysis tree is to extract specific key node types from the syntactic analysis tree to carry out all-around description on codes by analyzing the syntactic and semantic characteristics of core code fragments, construct corresponding characteristic vectors, realize the quantification of code characteristics and serve as the basis for constructing a defect code knowledge base and a software defect confirmation model.

As shown in fig. 1, establishing a rule set-based code keyword library includes:

common C language code rule sets are formulated for code structures, function names, parameter variables, constants, operational characters and the like of the software system by analyzing C language programming specifications such as GJB 5369 aerospace model software C language safety subsets and GJB8114-2013C/C + + language programming safety subsets; and then designing a keyword library of the corresponding codes according to each rule in the rule library. As shown in table 1, a C language keyword library example.

TABLE 1

Numbering	Key byte point naming	Key node representation information
			1	for	for circulation structure
2	stmtexp	A sentence
			3	decl	Claim an operation
4	incr	Self-adding operation
			5	cond	Comparison operation
6	vari	Common variables
			7	para	Environmental variables
8	assign	Assignment operations
			9	block	Program contained in great brackets
10	mul	Multiplication operations
			11	add	Addition operation
12	cons	Constant quantity
			13	type	Variable type
14	fun_call	Function call
			15	fname	Function name

Building the parse tree includes:

aiming at the minimum defect code fragment set extracted in the previous link, mapping from codes to a syntax analysis tree is realized according to keywords of different rules, the syntax analysis tree expresses source code syntax and semantic structure logic information in a tree form, a sub-tree represents a section of continuous source codes, and each section of codes is analyzed into a syntax analysis tree formed by various types of nodes.

Establishing the code feature matrix and the feature vector comprises the following steps:

and counting the occurrence times of each related node in the syntax analysis tree to construct a corresponding feature matrix. For the target code segment, different feature vectors of the context of the interest point need to be generated respectively to construct a feature matrix.

The establishing of the code feature matrix and the feature vector specifically comprises:

the feature vector generation comprises:

for the parse tree of the code fragment described above, 6 non-critical nodes (for, vari, para, cons, type, and block) are defined therein. Considering a specific whole syntax structure, the difference of different loop structures needs to be hidden by defining for nodes and block nodes as non-key nodes; certain parameter, variable addition and deletion operations occur in the code, and therefore, the vari, para and cons nodes are defined as non-critical nodes to hide the code differences caused by the parameters and the variables. For clarity, the feature vector depicted in fig. 3 omits 6 non-critical nodes. The above example may be described with 10-dimensional vectors (stmtexp, decl, incr, cond, assign, mul, add, fun _ call, fname, fpara).

The feature matrix generation comprises:

the generation of the feature matrix is an extension of the generation of the feature vector, and the row vector of each feature matrix corresponds to the key node vector of the context of the concerned point. This example describes the generation of the feature matrix for the root node, where the feature matrix is the cumulative sum of all the child node feature matrices and the for node initialization feature matrix. The model requires one subsequent traversal operation of the entire tree to generate the feature matrix of some nodes.

The feature matrix combination comprises the following steps:

in the process, parameters are required to be set to control the number of the merged nodes, the parameters are related to the total number of lexical units of the nodes and the number of the merged nodes, the selection principle is to reduce the false alarm rate as much as possible, and finally, the quantized and accurately described code feature vectors can be obtained.

Step two, building a defect code knowledge base based on cluster analysis, which comprises the following steps:

inputting a defect code feature vector set in the step one as a data set, wherein the specific clustering integration process comprises the following steps: suppose a data set X has n data objects, X ═ X₁，x₂，...，x_nFirstly, using N-times clustering algorithm to the data set X to obtain N clusters, where P ═ P₁，P₂，...，P_NIn which P is a member of a cluster_iAnd (i ═ 1, 2, 3., N) is a clustering result obtained by the ith clustering algorithm. Then, the consistency function T integrates the clustering result in P to obtain a new data partition P'.

As known from the clustering process, clustering a data set first generates a plurality of clustering results, and then these clusters are integrated.

(1) Multiple clustering result collections

Since the clustering effect varies with the clustering algorithm, the data set, and the feature vector, the selection of each base clusterer is based on the analysis of the data set when clustering.

(2) Multiple clustering result integration

The clustering results are integrated, and the method based on Voting (Voting) is adopted in the invention: clustering is performed by voting, that is, for a data point, if most clusters in the cluster set classify it as the ith class, it will eventually be classified as the ith class.

Step three, defect code confirmation based on supervised learning comprises the following steps:

taking the defect code knowledge base sample obtained in the second step as input, constructing a multi-class classifier, and judging whether the classifier meets the evaluation index by using the test sample; if the evaluation index is not met, a cost function is introduced to carry out iterative optimization on the classifier until the evaluation index is met, and the classification accuracy is improved as much as possible on the premise of ensuring the code recall ratio.

1. The defect classifier construction based on the defect code knowledge base comprises the following steps:

and in the first step and the second step, the defect codes are subjected to clustering analysis according to the extracted feature vectors to obtain defect code knowledge base samples with mark categories. The construction of the defect code classifier divides a sample knowledge base into two types, namely a training sample and a testing sample, wherein the training sample and the testing sample are used as learning algorithm input to learn the classifier; the latter serves as input for a classifier test to evaluate whether the classifier satisfies the corresponding index.

The software static analysis result of the invention needs to be divided into three classes, a multi-class classifier is designed by adopting a method of dividing a multi-class classification problem into a plurality of two-class classification problems, and the defect classifier is constructed by mainly adopting the following three schemes.

Fig. 5 is a schematic diagram illustrating a classification problem of a pair of other methods, and as shown in fig. 5, (1) the pair of other methods (OVR) includes:

the other method is to construct k two-class classifiers (k classes are set), wherein the ith classifier divides the ith class from the rest classes, the ith classifier takes the ith class in the training set as a positive class ("+ 1") during training, and the rest class points are negative classes ("-1") for training. During the judgment, a certain test sample respectively passes through k classifiers to obtain k output values, and if only one plus 1 occurs, the corresponding class is the class of the training sample; if the classification overlaps (more than one +1) or the classification can not be classified (none of the output is +1), judging which class of the training sample has the smallest distance from the training sample, and the class corresponding to the minimum distance is the class of the training sample.

(2) One-to-One method (One against One)

The method trains one classifier between every two classes, so for a k-class problem, there will be k (k-1)/2 classifiers. When an unknown sample is classified, each classifier judges the classification and votes for the corresponding classification, and the classification with the most votes is finally used as the classification of the unknown sample.

(3) DAG method (directed acyclic graph)

FIG. 6 shows a four-class problem DAG structure diagram, and as shown in FIG. 6, DAG is derived from a decision-directed cyclic graph DAG, the training process of which is similar to a "one-to-one" method, but only calls (k-1) classifiers when actually classifying for the k-class problem.

Due to learning techniques and noisy data in the sample knowledge base, the ideal classifier is often difficult to obtain. Therefore, it is necessary to select an appropriate evaluation index for the classifier to measure the performance of the classifier.

TABLE 2 classifier results versus actual tag comparison Table

Table 2 shows the classification result of the multi-class classifier and the actual label comparison result of the test sample, and the evaluation indexes of the multi-class classifier defined according to the table are as follows.

(1)Accuracy

Accuracy is the Accuracy, also called the integrated success rate, and represents the ratio of the number of all correctly classified samples to the total number of samples in the test sample set, i.e.:

(2)Precision

precision refers to Precision, also called Precision, which represents the percentage of the number of samples "need to be modified" that the classifier classifies correctly to the number of samples classified as "need to be modified", i.e.:

(3)Recall

recall, Recall, reflects the Recall of the classifier with respect to the "need to modify" category, i.e.:

and calculating the test result of the test sample set to obtain the three indexes, and if the test result does not meet the standard, training the learning model again to obtain a multi-class classifier meeting the performance indexes, so as to provide technical support for software defect confirmation.

2. Iterative optimization of defect classifier based on cost function

When the classifier designed in step three 1 confirms the software defect, the costs generated by different classes are asymmetric, so that the misclassification cost is taken as the research key point, a new multi-class classifier evaluation index is defined to measure the performance of the classifier according to the classification effect of the designed multi-class classifier and the common classifier evaluation index, and the cost function and the parameters are adjusted according to the evaluation result to iteratively optimize the classifier so as to obtain the multi-class classifier meeting the indexes. The construction process of the classifier based on the cost function is briefly described below by taking a two-class classifier as an example.

Constructing a cost function

The binary cost function may be constructed as follows:

wherein x_iIs of class c_iRatio of (a) x_jIs of class c_jRatio of (A), (B), (C)_i，c_j) Is of class c_iIs misjudged as category c_jThe cost of (a).

For this cost function, a cost matrix as shown in table 3 is constructed. Wherein c is₀Is of positive type, c₁Is of the inverse class, F (c)₀，c₀) And F (c)₁，c₁) Has a value of 0, F (c)₀，c₁) And F (c)₁，c₀) The value of (c) is given by the above formula. F (c)₀，c₁) Representing the cost of misclassifying a positive class into a negative class, F (c)₁，c₀) Representing the cost of misclassifying an anti-class into a positive class.

TABLE 3 cost matrix

(1) Constructing a risk function

For the aforementioned dichotomy problem, the risk function can be expressed as:

R(c₀|X)＝P(c₀|X)F(c₀,c₀)+P(c₁|X)F(c₁,c₀)＝P(c₁|X)F(c₁,c₀)

R(c₁|X)＝P(c₀|X)F(c₀,c₁)+P(c₁|X)F(c₁,c₁)＝P(c₀|X)F(c₀,c₁)

(2) adjusting cost function parameters

The cost function related parameters can be determined only through multiple experiments. The method comprises the steps of firstly setting initial parameters according to the distribution of various types in a training sample and related experience, and then determining cost function parameters through multiple tests.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A software defect confirmation method based on machine learning is characterized by comprising the following steps:

the method comprises the following steps: constructing a feature vector, comprising:

firstly, extracting defect code segments in a defect code set one by one, filtering the defect code segments into minimized defect code segments by adopting code filtering based on slice analysis, then converting the code segments into an abstract syntax tree by using a syntax analysis tree method, selecting a proper C language keyword set to form a characteristic matrix of a plurality of lines of codes according to different code rules, and finally obtaining a defect code characteristic vector set for subsequent machine learning according to a characteristic matrix merging method;

step two: the defect code knowledge base construction based on cluster analysis comprises the following steps:

inputting a defect code feature vector set as a data set, and clustering; performing cluster integration on a data set, firstly generating a plurality of clustering results, and then integrating the clusters; the method comprises the steps of collecting a plurality of clustering results and integrating the clustering results; forming a defect code knowledge base sample;

step three: supervised learning based defect code validation, comprising:

taking the obtained defect code knowledge base sample as input, constructing a multi-class classifier, and judging whether the classifier meets evaluation indexes or not by using a test sample; if the evaluation index is not met, a cost function is introduced to carry out iterative optimization on the classifier until the evaluation index is met.

2. The machine-learning based software bug validation method of claim 1, wherein the slicing analysis based code filtering comprises: the method comprises the steps of adopting a static slicing method to carry out slicing analysis on defect codes, firstly extracting a corresponding group of concerned variables from a section of defect code segment, then filtering out irrelevant codes according to a static backward slicing criterion, extracting statements which influence the concerned group of variable values in source codes to form a new code segment, and obtaining a minimized defect code segment related to the variable values.

3. The machine-learning-based software defect validation method of claim 2, wherein the slicing analysis-based code filtering is implemented using a static slicing algorithm based on a system dependency graph according to a static backward slicing criterion.

4. The machine-learning-based software bug validation method of claim 2, wherein the parsing tree based feature vector construction comprises:

building a syntax analysis tree, comprising:

aiming at the extracted minimized defect code fragment set, mapping codes to a syntactic analysis tree according to keywords of different rules, wherein the syntactic analysis tree adopts a tree form to express source code grammar and semantic structure logic information, a subtree represents a section of continuous source codes, each section of codes is analyzed into the syntactic analysis tree formed by multiple types of nodes, and a code characteristic matrix and a characteristic vector are established;

the method comprises the steps that corresponding feature matrixes are constructed by counting the occurrence frequency of each related node in a syntax analysis tree, and for target code segments, different feature vectors of contexts of interest points need to be generated respectively to construct the feature matrixes; the feature matrix generation comprises: performing one-time subsequent traversal operation on the whole tree to generate a feature matrix of some nodes; and feature matrix merging is performed.

5. The machine-learning-based software bug validation method of claim 1, wherein the clustering integration process comprises: suppose a data set X has n data objects, X ═ X₁，x₂，...，x_nFirstly, using N-times clustering algorithm to the data set X to obtain N clusters, where P ═ P₁，P₂，...，P_NIn which P is_iAnd (i ═ 1, 2, 3.., N) is a clustering result obtained by the ith clustering algorithm, and the clustering results in the P are integrated through a consistency function T to obtain a new data partition P'.

6. The machine learning-based software bug validation method of claim 1, wherein the bug classifier construction based on the bug code knowledge base comprises: performing cluster analysis on the defect codes according to the extracted feature vectors in the first step and the second step to obtain defect code knowledge base samples with labeled categories, dividing the sample knowledge base into a training sample and a test sample by the structure of a defect code classifier, and inputting the training sample as a learning algorithm to learn the classifier; the test samples are used as input for the classifier test to evaluate whether the classifier satisfies the corresponding index.

7. The machine learning-based software defect validation method of claim 1, wherein the method of constructing a defect classifier comprises:

a pair of other methods comprising:

the other method is to construct k two-class classifiers, wherein the ith classifier divides the ith class from the rest classes, the ith classifier takes the ith class in a training set as a positive class during training, the rest class points are negative classes for training, a certain test sample respectively passes through the k classifiers to obtain k output values during discrimination, and if only one plus 1 occurs, the corresponding class is the class of the training sample; if the phenomenon of overlapping or unclassification of the classification occurs, judging which class of the training sample has the smallest distance from the training sample, wherein the corresponding class with the smallest distance is the class of the training sample;

a one-to-one method comprising:

training a classifier between every two classes, so that for a k-class problem, k (k-1)/2 classifiers exist, when an unknown sample is classified, each classifier judges the class and votes for the corresponding class, and the class with the most votes is finally used as the class of the unknown sample.

8. The software defect validation method based on machine learning of claim 1, wherein selecting a suitable evaluation index for the classifier to measure the performance of the classifier comprises:

the evaluation indexes for defining the multi-class classifier comprise:

the accuracy rate represents the ratio of the number of all correctly classified samples to the number of all samples in the test sample set;

the accuracy rate represents the percentage of the number of samples needing to be modified and classified as correct by the classifier to the number of all samples needing to be modified;

recall, identifying recall of the classifier relative to the class that needs to be modified;

and calculating the test result of the test sample set to obtain the evaluation index of the multi-class classifier, and if the test result does not meet the standard, re-training the learning model to obtain the multi-class classifier meeting the performance index.

9. The machine learning-based software defect validation method of claim 1, wherein when the classifier in step three validates software defects, the costs generated by different classes are asymmetric, according to the classification effect of the multi-class classifier, in combination with the classifier evaluation index, a new multi-class classifier evaluation index is defined to measure the performance of the classifier, and the cost function and parameters are adjusted according to the evaluation result to iteratively optimize the classifier, so as to obtain the multi-class classifier satisfying the index.

10. The machine-learning-based software bug validation method of claim 9, further comprising:

constructing a cost function includes:

the binary cost function is constructed as follows:

wherein x_iIs of class c_iRatio of (a) x_jIs of class c_jRatio of (A), (B), (C)_i，c_j) Is of class c_iIs misjudged as category c_jConstructing a cost matrix aiming at the cost function;

constructing a risk function comprising:

for the binary problem, the risk function is expressed as:

R(c₀|X)＝P(c₀|X)F(c₀,c₀)+P(c₁|X)F(c₁,c₀)＝P(c₁|X)F(c₁,c₀)；

R(c₁|X)＝P(c₀|X)F(c₀,c₁)+P(c₁|X)F(c₁,c₁)＝P(c₀|X)F(c₀,c₁)；

wherein c is₀Is of positive type, c₁Is of the inverse class, F (c)₀，c₀) And F (c)₁，c₁) Has a value of 0, F (c)₀，c₁) Representing the cost of misclassifying a positive class into a negative class, F (c)₁，c₀) Representing the cost of misclassifying a reverse class into a forward class;

and adjusting cost function parameters, setting initial parameters according to various distributions and related experiences in the training samples, and determining the cost function parameters through multiple tests.